Productionalizing data science through MLOps

Article by: Alex Johnson

Your data science team has collected thorough requirements for a pressing business problem, collected and cleaned all the required data, and fit an effective, validated model that is ready to provide real value to real business users.

Mission accomplished, right?

Not so fast. First off, you need to figure out how to get the model in the hands of the business users — and provide quality assurance for model performance long term. Your team needs to consider the following:

• “How will my model perform on new, unseen data?”

• “Is my model training process reproducible?”

• “Can my model scale to the needs of the business?”

• “If changes are required, how long will it take to make those changes in production?”

Turns out, this is a difficult problem, and a problem that is not improving over time. Rexter Analytics, who regularly conduct surveys on the data science industry, in a 2017 survey cites that only 13 percent of data scientists say their models always get deployed, and this value has not improved since 2009 when the question first appeared on the survey.

A contributor to this issue is that machine learning production systems have a variety of moving parts outside of pure modeling code. This includes data collection and processing code, environment configuration, process management code, and monitoring code, to name a few. Complexity creates opportunity for technical debt and can increase lead time for both changes and deployment. Algorithmia cites that for companies that deploy models, half of respondents said they spend between 8 and 90 days to deploy a model, and 18 percent responded that it takes longer than 90 days. ML practitioners also cite that the scaling up and versioning/ reproducibility of their models were the two largest challenges their organizations face.

How can this be improved? By borrowing aspects of DevOps. There has been a recent rise in Machine Learning Operations (MLOps), a set of guiding principles focused on automation, collaboration, reproducibility, monitoring, and effective model scaling. The overall goal is ensuring quality of machine learning systems over time and reducing lead time for moving a model into production.

Automation (CI/CD)

Training and deploying a model is a multi-step process. This can often fall into the trap of being treated as a one-off task. Training and deployment processes should be self-contained in an automated pipeline process that can be triggered, not only after code changes, but also to train on new data, either periodically or when performance on recent data exceeds the established drift metric (more on drift metrics below).

By introducing CI/CD and operationalizing model training and monitoring, MLOps provides repeatable, consistent mechanisms for moving models to the target environment, and it opens additional opportunities to incorporate automated integration testing and parity across multiple environments.

Reproducibility — Data and Model Versioning

It’s generally an expectation in software development environments to use a version control system such as Git or SVC for tracking code and configuration artifacts. In data science workflows, there’s an added complexity of being able to create reproducible results (it is data science, after all). This includes being able to take a previous version of the target training/test datasets to reproduce model results.

Data versioning

With the onset of cheap, persistent storage (such as AWS S3), it becomes trivial to version data with modern tools such as Data Version Control, Metaflow, and SageMaker for small to medium datasets. In some workflows, such as SageMaker, this is done automatically alongside model training. For big data problems, repeatable splitting using numeric hashes is a common design pattern versus random value generation.

The cultural impact of making data assets used for model generation shareable and discoverable for a team is invaluable. It encourages emphasis on reproducibility, but also allows reusability by the team to leverage existing work rather than needing to reinvent the wheel when developing new models.

Want to read the full article? Alex continues through model versioning, data drift, scalability and more:

Interested in open data science roles with Logic20/20?

View open roles and apply here:

Enabling clarity through business and technology solutions.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Causal Inference — A Brief Introduction

How did I get started on my Data Science Journey

K-Means_Clustering — Prediction using Unsupervised ML

Processing Engines for Big Data

Big data and the Cloud: A Blog about the Implementation of Big Data in the Cloud

Important FAQs about Price Intelligence

The Applications of Image Recognition in Insurance

NumPy and SciPy in Python

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Enabling clarity through business and technology solutions.

More from Medium

Managing the ML Lifecycle with MLflow

How You Can Translate a Business Problem into a Data Science Problem

Julia programming on the Google Colab (Data Science Series)

The Data Scientist.