Versioning, transparency & monitoring in machine learning pipelines

In a nutshell, MLOps provides organizations a way to deploy and operate machine learning (ML) models 24/7, while at the same time being certain that particular standards and criteria on the ML model performance and governance are met. One could say that, from a business perspective, MLOps is the bridge between the experimentation world and the production world: building MLOps means that you’re going beyond AI pilots and Proofs-of-Concept, into a more mature and operationalized way of working with AI. To put it simply, MLOps is the set of processes that place machine learning and data science into the company’s operational environment.

Whether your organization is ready to start implementing MLOps, or you are already working on it, it’s important to understand that MLOps goes beyond just training the models. If you want to learn more from the business perspective of MLOps, I suggest you read my colleague’s article on the topic. However, if you’re more on the technical side, in this article I’ll share with you some principles and learnings in MLOps, focusing on versioning, transparency, and monitoring of machine learning pipelines. I’ll first talk about the importance of versioning and share key learnings and ways of working with versioning that I’ve used in my work. Then I’ll focus on monitoring, both from the theoretical and practical side. After that, I’ll discuss transparency and how that enables the rest of the organization to work seamlessly with the ML models. 

The importance of versioning in the MLOps

Versioning ML models and pipelines goes beyond the need for audit and compliance demands that machine learning solutions require. Versioning models and data is also an enabler for better monitoring of the models. In addition, versioning serves as an accelerator on how frequently the models can be updated and placed in production. 

With proper versioning, we can combine model predictions and the corresponding input data with model versions and trained data. With this kind of grouping, we can eventually detect data drifts and model miss performances. When it comes to implementation of the ML model, once we have set up the right versioning components and deployment scripts, then we can periodically run (batch) jobs that parse our predictions and analyze their quality.

In addition, with the right versioning of models an organization can run multiple models in production, can deploy models in a shadow mode as well as update and change models accordingly in a straightforward way. 

Finally, the importance of versioning is reflected in the rise of new tools and their quick adoption by the community. Every cloud vendor has a model registry and versioning components in their pipeline steps before deploying models. Open-source tools such as MLFlow and DVC – which were introduced recently – have been widely adopted by the machine learning community exactly because they help with versioning.

Versioning in ML solutions: From traditional software development practices to efficient MLOps

Machine learning is different from traditional software development. Although it involves a lot of software engineering, the level of sophistication in machine learning systems is much higher when it comes to data utilization and monitoring aspects. This difference increases the scope of versioning in ML pipelines: With machine learning systems, we have to bind model training and preprocessing code together with the data that was used. In addition, we need a way to connect the predictions in production and the corresponding input data with the versioned models and training data sets.

Having said the above, I believe going beyond the traditional software ways of working and tools is needed to establish proper versioning. This I learned fairly quickly during my first years in MLOps (read about my work at one of the biggest financial institute in Sweden). To anyone starting to build their machine learning operations, I would say that trying to solve the operationalization of multiple ML models with existing data and analytics architecture, versioning components, and CI/CD tools is not how MLOps should be run. Without putting aside prevalent software development and deployment methods, when operationalizing ML the entire architecture should be updated including feature store, model registry, and ML monitoring components.

MLOps: Different areas and workflows.

To be more specific, I would suggest that you try solutions such as MLFlow or native cloud model registry components. Those tools provide a straightforward way to save your models and experiments through easy-to-use APIs and they enable better visibility of the status of the model and of its versions (which versions are in the staging phase or production). Again, I would also suggest introducing a feature store – both offline and online. Feature store is a storage solution trying to address the challenges with serving features to the models both in training operations as well as in production. One of the challenges that feature stores are addressing is the need to keep a track of which features the models used in a particular training session. This type of time-travel querying helps enable better versioning of data in the training and production environments. There are other benefits too, but this is the most important one for the purpose of this blog article.

The introduction of the model registry and feature store will speed up the model development significantly and at the same time, it will enable greater visibility on which data, models, and resources are used in production.

Benefits of versioning

  • Reduced model development and deployment time
  • Improved transparency: Greater visibility and traceability on how we created our models and which data , parameters code we used
  • More flexibility on running multiple models to production
  • Improved ability to monitor ML pipelines
  • Improved compliance

On monitoring of ML models

Monitoring in MLOps is crucial. Before an organization deploys a machine learning model, it should make sure that proper model monitoring mechanisms are in place and that proper people with correct domain knowledge are following the ML predictions. Releasing models without having in place the proper monitoring mechanisms will result in at least low business value, but many other cases justify the importance of monitoring. One example worth paying attention to is building a biased ML model, where the model would be biased towards some particular group of people.

Description of MLOps process from lab to production.

Three perspectives: operational, model performance and business

Models should be monitored from three different perspectives that each raise different questions to be answered. These perspectives are:

  • Operational perspective: For example, are the models able to give a prediction within a particular time window and according to particular service-level agreements (SLAs)?
  • Model performance perspective: For example, are the models able to perform correct classifications and predictions, or are they biased?
  • Business perspective: Did the deployment of a model bring the desired business impact, such as an effective employee churn reduction within the company? In the ideal case, MLOps projects would also track business impacts.

In addition to the above, monitoring of ML models should be continuous. To exemplify, as it is well known, the predictions made by ML models are typically done based on the data that was used to train these models. However, that data is merely a snapshot of a past business situation. As ML models operate in the present, the situation can change rapidly, and the past situation may quickly become irrelevant. Consequently, the models should be continuously monitored and updated accordingly.

How to set up proper model monitoring mechanisms

In the use cases I have been involved in at my work at Silo AI, we have set up various ways and processes to monitor ML models. 

We set monitors on the input data that the model consumes. We check for example if we have data schema changes before the ML model is giving predictions. In other words, we have built-in alerts as well as exception handling mechanisms that are set off if particular information is missing. In addition, we have set thresholds to the incoming data and rules that mark predictions as less certain if they don’t fit the usual trend, e.g. the input data are suddenly outside of a given distribution. These are all ways to monitor the model input data.

In addition, when it comes to monitoring, we also create automated batch monitoring jobs analyzing the predictions that our model has made (Automated Monitoring). We can also provide maintenance instructions such as performance baselines, important metrics, etc. to the team responsible for monitoring tasks. Other ways to improve monitoring may focus on analyzing the relationship of the input data together with the predictions made and the training data sets. This way, we are able to detect data drifts and model drifts in general.  In addition, our monitoring jobs also analyze model predictions with a purpose to see if our model is biased if it over-favors one class vs another. Lastly, as a bit of a meta work, we have also monitored our monitoring jobs. For example, a lack of alerts or notifications from our monitoring services for an extended period, can be on its own an indication of something going wrong. In addition to the above, in many areas, manual monitoring is also required.

On transparency of the ML models

Transparency in ML is mainly about how understandable a prediction of a model is by the people interacting with the model. With transparent AI systems, it is possible to see into the decision-making process, which will help the people interacting with the system. Somebody can say that with transparency we put more light on the black box operations of an ML model. In our days there is a great interest in the explainability aspects of ML and there are software packages and algorithms that help a lot towards a more explainable and fair AI. SHAP, Alibi, and AI Fairness 360 are a few examples of those. 

In general,  the level of transparency that we want to reach really depends on the nature of the use cases. So for example using ML for analyzing loan applications will require high explainability on the model outcomes. On the other hand, using ML for classifying emails as spam may be less demanding on the explainability aspects. Setting the barrier for acceptable transparency is always a discussion with the business and compliance people who will own the ML model since they should be able to explain and understand the predictions. 
In general, I have to say that transparency in ML operations and especially in the operation of neural networks is a very active research topic nowadays. Its importance is obvious since in the majority of the use cases we need to explain how decisions are taking place to trust our models. In the context of MLOps before deploying models we do spend time on the explainability aspects of our models, we set our goals, and in most of the use cases together with the model predictions we also provide and explainability reports.

Trusted AI & MLOps at the core. How MLOps combines all the different aspects of operationalizing AI.

Silo AI worked on scaling AI across several highly regulated organizations in the healthcare, finance, and energy sector. We have identified how critical it is to monitor ML models in production and through our work with a myriad of companies, we have identified all of those different architectural components, processes, and roles that are needed to have well-governed and auditable ML pipelines. 

To be more precise, we are not only helping organizations to set up feature and model monitoring and versioning components, but instead, we also are helping our clients to introduce roles and processes around those components. This is crucial as the MLOps is not just a platform, but rather a human process assisted by platforms. To give an example of this work, we have established multi-step approval processes when it comes to releasing ML models, developed data quality processes to evaluate features that are used by the ML models, and created various batch operations to evaluate model predictions and more. This way everyone in the machine learning operations offers a systematic and thorough way for the organization to scale their AI. 

Read more articles written by Harry Souris.

Would you like to hear more? Get in touch with Jukka Remes, Lead AI Solutions Architect, at or via LinkedIn.


No items found.

Want to discuss how Silo AI could help your organization?

Get in touch with our AI experts.
Harry Souris
Lead Solution Architect
Silo AI

Harry Souris is a Senior AI Solutions Architect experienced in the area of Data Science, Machine Learning & Big Data in various roles. Experienced working as solution architect in data & MLOps platforms of financial and medical institutions focusing on the data quality and governance aspects and on productifying ML use cases by establishing proper processes and technologies.

Share on Social
Subscribe to our newsletter

Join the 5000+ subscribers who read the Silo AI monthly newsletter to be among the first to hear about the latest insights, articles, podcast episodes, webinars, and more.

What to read next

Ready to level up your AI capabilities?

Succeeding in AI requires a commitment to long-term product development. Let’s start today.