MLOps
How to operate a ML system productively?
groups/mlopsvn
About me
➢ Working profile
○ 2016-2017: Machine Learning Engineer Freelancer
○ 2017-2019: Machine Learning Researcher @ University of Aizu
○ 2019-2020: Machine Learning Engineer @ Heligate
○ 2020-2022: Senior Machine Learning Engineer @ One Mount
○ 2022-present:
■ Expert Machine Learning Engineer @ MSB
■ Admin of MLOps VN
➢ Contact: https://www.linkedin.com/in/quan-dang/
groups/mlopsvn
Ordinary ML workflow
Manual process
groups/mlopsvn 3
❖ Manual executions
❖ Disconnection b/w DS & Ops engineers
Any problems? ❖
❖
Infrequent release iterations
No CI/CD
❖ No monitoring
Manual process
groups/mlopsvn 4
A new design with
new components
➢ Source control
➢ CI/CD tool
➢ Feature Store
➢ Data/ML Pipelines
➢ Model Registry
➢ ML Metadata Store
➢ Performance Monitoring
Automation
(https://ml-ops.org/content/mlops-principles)
groups/mlopsvn 5
Source control
● Git workflow
● Code version control
● Data version control
groups/mlopsvn 6
CI/CD tool
CI/CD user workflow
(https://docs.gitlab.com/ee/ci/introduction/)
groups/mlopsvn 7
Feature store
● Why?
○ Feature reuse
○ Single source for both training
and serving (consistency)
○ Monitor for drift and quality
issues
https://www.tecton.ai/blog/what-is-a-feature-store
groups/mlopsvn 8
Data pipelines
9
groups/mlopsvn
ML pipelines
10
groups/mlopsvn
ML metadata store
● Run information
○ Description
○ Start/end time, duration
○ Executor
○ Parameters
● Artifacts in each step
MLFlow experiment tracking
○ Input/output data
○ Figures
○ HTML files
● Metrics
○ ML related: mae, r2, .etc.
○ Data related: completeness,
ks-test, .etc.
Kubeflow Pipeline
Metadata
(Trevor Grant, et. al,
Kubeflow for
groups/mlopsvn Machine Learning) 11
Model registry
UI and a set of APIs to manage model:
● Model lineage (which experiment/run
produced the model)
● Model version
● Model state transitions
Model lineage and description
i.e., from staging to production
● Model description/documentation
● Model validation results/metrics
Model version
groups/mlopsvn and state 12
Performance monitoring (1) System health:
Performance level: ● Number of requests (throughput)
● Average response time (latency)
● Data integrity: inspect volume, variety, ● Number of failure requests
veracity, velocity of incoming & outgoing ● IO/Memory/CPU usage
data for detecting outliers and anomaly ● Disk utilization
● Model drift:
● System uptime
Entities Drift
X: Inputs (features) Features drift API outgoing data:
y: Outputs (labels) Target drift ● Model metadata: name, version, docker
Relationship between X Concept drift image
and y
● Input data (features)
● Business metrics and ROI ● Predictions
e.g. Click-through Rate (CTR) and ● System actions
engagement and engagement metrics in ● Explanation
13
social network company groups/mlopsvn
Performance monitoring (2)
Seldon Core API dashboard
(https://github.com/SeldonIO/seldon-core/blob/master/README.md)
groups/mlopsvn 14
Performance monitoring (3)
Payload logging with ELK stack
(https://github.com/SeldonIO/seldon-core/blob/master/README.md)
groups/mlopsvn 15
Additional: Experimentation platform
A custom platform/on premises: On cloud:
● EDA: JupyterHub, Jupyter Notebook ● AWS Sagemaker Studio
● IDE: Code server ● Google Vertex AI
● Pre-built docker images with common ● Azure Machine Learning Studio
libraries shared among team members
● Try to build your own libraries, e.g., a
custom AutoML library with data in and
model out
● Git for code versioning
● Data Versioning (e.g. DVC)
● Experiment Tracking (e.g. MLFlow)
groups/mlopsvn 16
Additional: Model serving frameworks (1)
General purpose Model agnostic Framework specific
groups/mlopsvn
Additional: Model serving frameworks (2)
Features KServe Seldon Core BentoML Triton Serve Ray Serve
(0.9.0) (1.14.0) (1.0.0) (2.23.0) (1.13.0)
GPU
Micro (adaptive) batching
Offline batch serving
Autoscaling
Scale to 0
Canary deployments
A/B tests/MAB deployments
Native Kafka Integration
GRPC
Tracing
Prometheus metrics
groups/mlopsvn
A comparison of model agnostic serving frameworks
Skill set
Roles and their intersections contributing to
the MLOps paradigm
(https://arxiv.org/pdf/2205.02302.pdf)
groups/mlopsvn 19
Study materials
● Books
● Courses
○ https://www.coursera.org/specializations/machine-learning-engineering-for-production-mlops
○ https://stanford-cs329s.github.io
○ https://fullstackdeeplearning.com
○ https://github.com/DataTalksClub/mlops-zoomcamp
○ https://github.com/alexeygrigorev/mlbookcamp-code
● Blogs
○ https://madewithml.com
○ https://mlops.community/blog 20
groups/mlopsvn
To sum up
groups/mlopsvn 21