What is MLOps?
MLOps, short for Machine Learning Operations, is essentially DevOps for machine learning models. It establishes a set of practices for efficiently developing, testing, and deploying machine learning models into production environments. Traditionally, data scientists often develop models in isolation, leading to challenges in replicability and auditability when transitioning those models to real-world applications. MLOps bridges this gap by fostering collaboration between data scientists, engineers, and IT operations teams. This collaboration ensures models are built with a production-ready mindset, consistently tested and deployed, and continuously monitored for improvement.
What is Azure Machine Learning?
Azure Machine Learning is a cloud service that accelerates and manages the machine learning project lifecycle. Machine learning professionals, data scientists, and engineers can use it in their day-to-day workflows: Train and deploy models and manage MLOps.
You can create a model in Azure Machine Learning or use a model built from an open-source platform, such as Pytorch, TensorFlow, or scikit-learn. MLOps tools help you monitor, retrain, and redeploy models.
Azure Machine Learning is for individuals and teams implementing MLOps within their organization to bring machine learning models into production in a secure and auditable production environment.
Data scientists and ML engineers will find tools to accelerate and automate their day-to-day workflows. Application developers will find tools for integrating models into applications or services. Platform developers will find a robust set of tools, backed by durable Azure Resource Manager APIs, for building advanced ML tooling.
Enterprises working in the Microsoft Azure cloud will find familiar security and role-based access control (RBAC) for infrastructure. You can set up a project to deny access to protected data and select operations.
Key Features of Azure Machine Learning
Git Integration
Azure Machine Learning works with Git version control systems. This allows you to keep track of changes to your machine-learning code. This ensures collaboration and reproducibility by keeping a history of code versions and enabling rollbacks if necessary.
MLflow Integration
By integrating with MLflow, an open-source platform for managing the machine learning lifecycle, Azure Machine Learning provides tools for experiment tracking and model registry. This facilitates experiment comparison and version control of trained models and simplifies model deployment.
Machine Learning Pipeline Scheduling
This feature allows you to schedule the execution of entire machine-learning pipelines. Pipelines automate repetitive tasks involved in model training and deployment, such as data pre-processing, training, and evaluation. Scheduling streamlines these processes and reduces manual intervention.
Azure Event Grid Integration
Azure Event Grid is a service for defining event-driven workflows. Azure Machine Learning integrates with it, allowing you to trigger machine learning workflows based on specific events. This enables real-time model retraining or deployment based on data updates or other triggers.
Compatibility with CI/CD Tools
Azure Machine Learning seamlessly integrates with popular CI/CD (Continuous Integration and Continuous Delivery) tools like GitHub Actions or Azure DevOps. This allows you to incorporate machine learning model deployment into your existing DevOps pipelines, streamline the deployment process, and ensure consistency between development, testing, and production environments.
Monitoring and Auditing
Azure Machine Learning offers comprehensive tools for monitoring model performance in production. You can track metrics like accuracy, precision, and recall to identify potential issues and opportunities for improvement. Additionally, auditing features provide a record of model deployments and changes, ensuring transparency and compliance.
Artifact Management
This feature stores artifacts, such as code snapshots, logs, and other outputs, generated during your machine learning runs. This facilitates troubleshooting by allowing you to revisit past runs and identify the cause of errors. Additionally, it supports model reproducibility by capturing the exact environment and configuration used for training.
Lineage Tracking
Lineage tracking keeps track of the origin and dependencies between various assets used in your machine learning workflows. This includes data, containers (Docker images), and compute resources. Lineage tracking provides transparency and auditability by allowing you to understand the provenance of your models and identify potential bottlenecks or biases in the data pipeline.
Collaboration for machine learning teams
Machine learning projects often require a team with varied skill sets to build and maintain. Azure Machine Learning has tools that help enable collaboration, such as:
Shared notebooks, compute resources, data, and environments
Tracking and auditability that shows who made changes and when
Asset versioning
Tools for developers
Developers find familiar interfaces in Azure Machine Learning, such as:
Studio UI
The Azure Machine Learning Studio is a graphical user interface for a project workspace. In the studio, you can:
View runs, metrics, logs, outputs, and so on.
Author and edit notebooks and files.
Manage common assets, such as
o Data credentials
o Compute
o Environments
Visualize run metrics, results, and reports.
Visualize pipelines authored through developer interfaces.
Author AutoML jobs.
Plus, the designer has a drag-and-drop interface where you can train and deploy models.
Enterprise-readiness and security
Azure Machine Learning integrates with the Azure cloud platform to add security to ML projects.
Security integrations include:
Azure Virtual Networks (VNets) with network security groups
Azure Key Vault, where you can save security secrets, such as access information for storage accounts
Azure Container Registry set up behind a VNet
Azure integrations for complete solutions
Other integrations with Azure services support a machine learning project from end to end. They include:
Azure Synapse Analytics to process and stream data with Spark
Azure Arc, where you can run Azure services in a Kubernetes environment
Storage and database options, such as Azure SQL Database, Azure Storage Blobs, and so on
Azure App Service allows you to deploy and manage ML-powered apps
Machine learning project workflow
Typically models are developed as part of a project with an objective and goals. Projects often involve more than one person. When experimenting with data, algorithms, and models, development is iterative.
Project lifecycle
While the project lifecycle can vary by project, it will often look like this:
A workspace organizes a project and allows for collaboration for many users all working toward a common objective. Users in a workspace can easily share the results of their runs from experimentation in the studio user interface or use versioned assets for jobs like environments and storage references.
When a project is ready for operationalization, users' work can be automated in a machine-learning pipeline and triggered on a schedule or HTTPS request.
Models can be deployed to the managed inferencing solution for both real-time and batch deployments, abstracting away the infrastructure management typically required for deploying models.
Train models
In Azure Machine Learning, you can run your training script in the cloud or build a model from scratch. Customers often bring models they've built and trained in open-source frameworks so they can operationalize them in the cloud.
Open and interoperable
Data scientists can use models in Azure Machine Learning that they've created in common Python frameworks, such as:
PyTorch
TensorFlow
scikit-learn
XGBoost
LightGBM
Other languages and frameworks are supported as well, including:
R
.NET
Automated featurization and algorithm selection (AutoML)
In a repetitive, time-consuming process, in classical machine learning, data scientists use prior experience and intuition to select the right data featurization and algorithm for training. Automated ML (AutoML) speeds this process and can be used through the studio UI or Python SDK.
Hyperparameter optimization
Hyperparameter optimization, or hyperparameter tuning, can be a tedious task. Azure Machine Learning can automate this task for arbitrary parameterized commands with little modification to your job definition. Results are visualized in the studio.
Multinode distributed training
The efficiency of training for deep learning and sometimes classical machine learning training jobs can be drastically improved via multi-node distributed training. Azure Machine Learning compute clusters offer the latest GPU options.
Supported via Azure ML Kubernetes and Azure ML compute clusters:
PyTorch
TensorFlow
MPI
The MPI distribution can be used for Horovod or custom multinode logic. Additionally, Apache Spark is supported via Azure Synapse Analytics Spark clusters (preview).
Embarrassingly parallel training
Scaling a machine learning project may require scaling embarrassingly parallel model training. This pattern is common for scenarios like forecasting demand, where a model may be trained for many stores.
Deploy models
To bring a model into production, it is deployed. Azure Machine Learning's managed endpoints abstract the required infrastructure for both batch or real-time (online) model scoring (inferencing).
Real-time and batch scoring (inferencing)
Batch scoring, or batch inferencing, involves invoking an endpoint with a reference to data. The batch endpoint runs jobs asynchronously to process data in parallel on compute clusters and store the data for further analysis.
Real-time scoring, or online inferencing, involves invoking an endpoint with one or more model deployments and receiving a response in near-real-time via HTTP. Traffic can be split across multiple deployments, allowing for testing new model versions by diverting some amount of traffic initially and increasing once confidence in the new model is established.
See:
MLOps: DevOps for machine learning
DevOps for machine learning models, often called MLOps, is a process for developing models for production. A model's lifecycle from training to deployment must be auditable if not reproducible.
ML model lifecycle
Integrations enabling MLOPs
Azure Machine Learning is built with the model lifecycle in mind. You can audit the model lifecycle down to a specific commit and environment.
Some key features enabling MLOps include:
git integration
MLflow integration
Machine learning pipeline scheduling
Azure Event Grid integration for custom triggers
Easy to use with CI/CD tools like GitHub Actions or Azure DevOps
Also, Azure Machine Learning includes features for monitoring and auditing:
Job artifacts, such as code snapshots, logs, and other outputs
Lineage between jobs and assets, such as containers, data, and compute resources
Comments