0% found this document useful (0 votes)

21 views20 pages

System Design For AI - ML Workloads

The document outlines various components and tools essential for machine learning (ML) workflows, including data ingestion, feature engineering, model training, deployment, monitoring, and security. It emphasizes the importance of clean data, model explainability, and continuous improvement through feedback loops and retraining. Additionally, it highlights the significance of trade-offs in AI system design to meet specific product goals.

Uploaded by

Gavvalapally Nithish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views20 pages

System Design For AI - ML Workloads

Uploaded by

Gavvalapally Nithish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Want your dream tech job?

Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Data Ingestion & ETL
Pipelines
What it is
A structured pipeline that
collects, transforms, and loads
raw data into clean, usable
formats for ML training and
inference.

Used For:

Supplying Pre-processed
reliable data to models

Tools: Apache Airflow

Kafka

Spark

AWS Glue

🧠 Data is the foundation of ML — make it clean

and fast.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Feature Engineering &
Feature Stores
What it is
A system for building,
storing, and serving
consistent ML features for
both offline training and
online inference
environments.

Used For:

Reusability and consistency in feature

values across environments.

Tools: Feast

Tecton

Vertex AI

🧠 Good features = good predictions.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Model Training
Infrastructure
What it is
A scalable and reproducible
environment for training
models with distributed
compute, GPU support, and
version control.

Used For:

Pre-processed
Parallel model data to models
Experimentation Reproducibility.
training

Tools: MLflow

Kubeflow

Ray

W&B

🧠 Scale your training like a distributed system.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Data Versioning &
Lineage
What it is
Tracks historical versions
of data and its flow across
training pipelines to ensure
reproducibility and
auditability.

Used in:

Pre-processed
Reproducing dataData
to models
Rollback auditing
experiments

Tools:
DVC

LakeFS

Pachyderm

🧠 Know exactly which data trained which model.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Model Deployment &
Serving
What it is
Infrastructure that
exposes trained models
via APIs or services for
real-time or batch
inference in production.

Used in:

Making model predictions

available to end-users or
systems.

Tools: TensorFlow Serving

TorchServe

BentoML

FastAPI

🧠 A model not deployed is just a math file.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Batch vs Real-Time
Inference
What it is
Two inference modes—
batch for scheduled
processing, and real-
time for on-the-fly, low-
latency predictions.

Used in:

Offline scoring vs live model

response.

Tools: Airflow

Kafka

FastAPI

🧠 Use the right serving mode for the use case.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Vector Databases

What it is
Databases optimized for
similarity search on
embeddings using nearest-
neighbor indexing and fast
vector queries.

Used in:

Pre-processed
data
LLMto models
RAG systems.
Semantic search Recommendations

Tools: Pinecone

Weaviate

Qdrant

FAISS

🧠 Embeddings need fast, approximate lookups.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Streaming for Online
Learning
What it is
Live data pipelines that
provide continuous input
for model predictions,
monitoring, or
incremental updates.

Used in:

Real-time ML systems and

feedback-driven workflows.

Tools:
Kafka

Flink

Spark Streaming

🧠 Real-time data = real-time value.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Model Explainability

What it is
Techniques that explain
why a model made a certain
prediction, improving
transparency and
stakeholder trust.

Used in:

Pre-processed
Regulatory data to models
Debugging User trust.
compliance

Tools:
SHAP

LIME

Captum

🧠 Black-box models need transparent

explanations.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Monitoring & Drift
Detection
What it is
Observes model
performance over time and
detects unexpected
changes in input data or
prediction quality.

Used in:

Maintaining accuracy in
production.

Tools:
WhyLabs

Arize AI

Prometheus

🧠 What works today may fail tomorrow —

monitor always.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Feature Drift
Monitoring
What it is
Detects shifts in the
statistical distribution of
input features over time,
which could impact model
accuracy.

Used in:

Identifying early signs of model

degradation.

Tools:
Evidently

Alibi Detect

River

🧠 Keep an eye on the data, not just the

predictions.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Model Security &
Governance
What it is
Frameworks and tools
that secure ML assets,
track model usage, and
enforce organizational
policies.

Used in:

Pre-processed
Access control Auditability data toEthical
models
compliance.

Tools:
MLflow

Seldon

Azure Purview

🧠 Models are assets — protect them like code.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

CI/CD for ML
Pipelines
What it is
Automated workflows that
test, validate, and deploy
ML models continuously like
traditional DevOps
pipelines.

Used in:

Fast, reliable shipping of model

updates.

Tools:
GitHub Actions

Jenkins

DVC

🧠 Ship ML code as confidently as app code.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Retraining Triggers
What it is
Automated workflows that
test, validate, and deploy
ML models continuously like
traditional DevOps
pipelines.

Used in:

Automating model refresh cycles.

Triggers:
Time-based

Drift-based

Feedback loops

🧠 Smart retraining = stable performance.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Multi-Model
Management
What it is
The practice of deploying
and monitoring multiple
models for different
users, regions, or
experiments.

Used in:

Pre-processed
A/B testing Personalization data to models
Shadow testing

Tools:
Seldon Core

BentoML

MLflow

🧠 Manage models like a portfolio.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Shadow & Canary
Deployment
What it is
Strategies to test models
in production on a limited
audience or silently
before full rollout.

Used in:

Reducing deployment risks and

regressions.

Tools:
Istio

Seldon

Argo Rollouts

🧠 Test in production without breaking

production.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Feedback Loops &
Online Learning
What it is
Systems that feed model
outputs and user
interactions back into
training pipelines to improve
accuracy over time.

Used in:

Continuous improvement and

personalization.

Tools:
Kafka

Redis

Streamlit + training pipelines

🧠 Test in production without breaking

production.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Key Trade-offs in AI
System Design
What it is
Design decisions that
balance speed, cost,
complexity, and accuracy
depending on your
product goals.

Used in:

Prioritizing what's most important

for the use case.

Examples:

Latency Accuracy Serverless Kubernetes

🧠 Every design choice has a trade-off — choose

wisely.

Want your dream tech job? Follow Lakshmi Marikumar & Everyone Who Codes for expert career advice.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

WANT YOUR DREAM TECH JOB?
Follow Lakshmi Marikumar & Everyone Who Codes
for expert career advice.

Save For Later

ML Data Management for Experts
No ratings yet
ML Data Management for Experts
122 pages
Machine Learning Guide: Basics to Deployment
No ratings yet
Machine Learning Guide: Basics to Deployment
2 pages
Webinar Slides Mlops
100% (1)
Webinar Slides Mlops
35 pages
GCP ML Engineer Exam Guide
100% (1)
GCP ML Engineer Exam Guide
2 pages
Lecture+Notes Intro To MLOps Session3
No ratings yet
Lecture+Notes Intro To MLOps Session3
8 pages
Slides Rethink Machine Learning For Regulated Industries
No ratings yet
Slides Rethink Machine Learning For Regulated Industries
30 pages
CT1-MLOPs S1 2
No ratings yet
CT1-MLOPs S1 2
68 pages
Unit-1 Introduction To Machine Learning (5hrs)
No ratings yet
Unit-1 Introduction To Machine Learning (5hrs)
8 pages
MLOPs Artem Koval
No ratings yet
MLOPs Artem Koval
38 pages
cs329s 2022 02 Slides MLSD
No ratings yet
cs329s 2022 02 Slides MLSD
99 pages
Jade Abbott - Mls Hidden Tasks
No ratings yet
Jade Abbott - Mls Hidden Tasks
78 pages
C2 - W1 Mlopssadsa
No ratings yet
C2 - W1 Mlopssadsa
111 pages
Getting Started With MLOPs 21 Page Tutorial
No ratings yet
Getting Started With MLOPs 21 Page Tutorial
21 pages
AWS MLOps Slides
No ratings yet
AWS MLOps Slides
185 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
Slides Seminar Tung KHUAT v2
No ratings yet
Slides Seminar Tung KHUAT v2
52 pages
ML Deployment & MLOps Guide
No ratings yet
ML Deployment & MLOps Guide
56 pages
Designing Machine Learning Systems by Chip Huygen by Rick
100% (1)
Designing Machine Learning Systems by Chip Huygen by Rick
15 pages
ML System Architecture Guide
No ratings yet
ML System Architecture Guide
47 pages
MLOps & ML Lifecycle Mastery
No ratings yet
MLOps & ML Lifecycle Mastery
106 pages
AWS AI Practitioner Exam Prep Guide
No ratings yet
AWS AI Practitioner Exam Prep Guide
20 pages
Chang Si Ju
No ratings yet
Chang Si Ju
2 pages
Automatically Build ML Models On Amazon SageMaker Autopilot - Tapan Hoskeri
No ratings yet
Automatically Build ML Models On Amazon SageMaker Autopilot - Tapan Hoskeri
26 pages
ML Deployment Scenarios CheatSheet
No ratings yet
ML Deployment Scenarios CheatSheet
2 pages
Automate Machine Learning - Aparna Elangovan
No ratings yet
Automate Machine Learning - Aparna Elangovan
26 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
AIM001 Introduction To AI Services
No ratings yet
AIM001 Introduction To AI Services
28 pages
Unit 2
No ratings yet
Unit 2
12 pages
Lecture 3 - 1-ML and Data Systems Fundamentals
No ratings yet
Lecture 3 - 1-ML and Data Systems Fundamentals
48 pages
AIDI 1010 WEEK3 (A) v1.4
No ratings yet
AIDI 1010 WEEK3 (A) v1.4
24 pages
Advanced Tech Stack For AI
No ratings yet
Advanced Tech Stack For AI
3 pages
ML Pipeline Introduction
No ratings yet
ML Pipeline Introduction
29 pages
Definition ML GCP
No ratings yet
Definition ML GCP
6 pages
MLOps Asilla 20221124
No ratings yet
MLOps Asilla 20221124
16 pages
Coursera 2.4
No ratings yet
Coursera 2.4
41 pages
ML Pipelines AI Community
No ratings yet
ML Pipelines AI Community
53 pages
Notesv 1
No ratings yet
Notesv 1
6 pages
Unit 2
No ratings yet
Unit 2
9 pages
2021 Reinvent Attendee Guide ML OD
No ratings yet
2021 Reinvent Attendee Guide ML OD
47 pages
MLOps Specialization Course January 2024!5!15
No ratings yet
MLOps Specialization Course January 2024!5!15
11 pages
MLOps Specialization Course January 2024
No ratings yet
MLOps Specialization Course January 2024
24 pages
Deepdive On Amazon Sagemaker and Aws Reinvent New Features
No ratings yet
Deepdive On Amazon Sagemaker and Aws Reinvent New Features
31 pages
File 22
No ratings yet
File 22
37 pages
AI Engineer Interview Prep Guide
No ratings yet
AI Engineer Interview Prep Guide
16 pages
Leveraging MLOps and DataOps To Operationalize ML and AI
No ratings yet
Leveraging MLOps and DataOps To Operationalize ML and AI
39 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
5 pages
Pa Unit 5
No ratings yet
Pa Unit 5
17 pages
Day - 6 - WONotes
No ratings yet
Day - 6 - WONotes
11 pages
Session 29 - MLOps Tools Overview-New
100% (1)
Session 29 - MLOps Tools Overview-New
40 pages
Build Business Outcomes With Artificial Intelligence and Machine Learning - Spencer Marley and Aashmeet Kalra-1
No ratings yet
Build Business Outcomes With Artificial Intelligence and Machine Learning - Spencer Marley and Aashmeet Kalra-1
30 pages
Unit 4
No ratings yet
Unit 4
28 pages
Segmentation Dataset
No ratings yet
Segmentation Dataset
41 pages
04 APCR AIF Session+4 Content+Review
No ratings yet
04 APCR AIF Session+4 Content+Review
46 pages
3 Using The Qradar Siem Dashboard
No ratings yet
3 Using The Qradar Siem Dashboard
93 pages
Helptext and Online Documentation
No ratings yet
Helptext and Online Documentation
21 pages
Developing Microsoft SQL Server 2012 Databases
No ratings yet
Developing Microsoft SQL Server 2012 Databases
7 pages
Como Cargar El IOS de Un Servidor TFTP en Cisco Packet Tracer
No ratings yet
Como Cargar El IOS de Un Servidor TFTP en Cisco Packet Tracer
2 pages
Bemcmd Documentation
No ratings yet
Bemcmd Documentation
264 pages
Response ASE 16
No ratings yet
Response ASE 16
9 pages
Genero Brochure
No ratings yet
Genero Brochure
4 pages
Pass PMP Exam in 60 Days: PMP Study Plan 100% Working
No ratings yet
Pass PMP Exam in 60 Days: PMP Study Plan 100% Working
3 pages
2.1 - Service Models (XaaS)
No ratings yet
2.1 - Service Models (XaaS)
22 pages
Mobile App Security for IT Pros
No ratings yet
Mobile App Security for IT Pros
20 pages
IBM ZOS Management Facility Messages
No ratings yet
IBM ZOS Management Facility Messages
260 pages
Selenium Interview Question-1-1
No ratings yet
Selenium Interview Question-1-1
16 pages
Themes For Winforms: Componentone
No ratings yet
Themes For Winforms: Componentone
47 pages
Innovation in Accounting
No ratings yet
Innovation in Accounting
8 pages
Red Hat Enterprise Linux Overview
No ratings yet
Red Hat Enterprise Linux Overview
67 pages
Unit 4 FIoT Notes
No ratings yet
Unit 4 FIoT Notes
23 pages
Question Text: Clear My Choice
No ratings yet
Question Text: Clear My Choice
11 pages
QAWhat Is Clinical Data Management
No ratings yet
QAWhat Is Clinical Data Management
38 pages
How To Create
No ratings yet
How To Create
35 pages
Netbackup Media Server Migration Guide
No ratings yet
Netbackup Media Server Migration Guide
54 pages
Gartner SDLC
No ratings yet
Gartner SDLC
12 pages
SQL Reference Volume II - Release 8
100% (1)
SQL Reference Volume II - Release 8
789 pages
SQL Injection: SQL Injection Is A Web Security Vulnerability and It
No ratings yet
SQL Injection: SQL Injection Is A Web Security Vulnerability and It
2 pages
Employee Self Service (Ess) - Ep02: EP Training EP02, Nazih Kayyali (HP)
No ratings yet
Employee Self Service (Ess) - Ep02: EP Training EP02, Nazih Kayyali (HP)
75 pages
Big Data Analytics Solutions
No ratings yet
Big Data Analytics Solutions
6 pages
Install Robot Framework on Windows 7
No ratings yet
Install Robot Framework on Windows 7
5 pages
Introductiontoe Commerce 110216225015 Phpapp01
No ratings yet
Introductiontoe Commerce 110216225015 Phpapp01
55 pages
AICITSS Advanced IT Course Resources
100% (1)
AICITSS Advanced IT Course Resources
2 pages
Distributed System 25 Questions
No ratings yet
Distributed System 25 Questions
19 pages
The History and Future of Java Programming Language
No ratings yet
The History and Future of Java Programming Language
4 pages

System Design For AI - ML Workloads

Uploaded by

System Design For AI - ML Workloads

Uploaded by

Want your dream tech job?

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Tools: Apache Airflow

🧠 Data is the foundation of ML — make it clean

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Reusability and consistency in feature

🧠 Good features = good predictions.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

🧠 Scale your training like a distributed system.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

🧠 Know exactly which data trained which model.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Making model predictions

Tools: TensorFlow Serving

🧠 A model not deployed is just a math file.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Offline scoring vs live model

🧠 Use the right serving mode for the use case.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

🧠 Embeddings need fast, approximate lookups.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Real-time ML systems and

🧠 Real-time data = real-time value.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

🧠 Black-box models need transparent

Checkout my Topmate page https://topmate.io/lakshmimarikumar

🧠 What works today may fail tomorrow —

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Identifying early signs of model

🧠 Keep an eye on the data, not just the

Checkout my Topmate page https://topmate.io/lakshmimarikumar

🧠 Models are assets — protect them like code.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Fast, reliable shipping of model

🧠 Ship ML code as confidently as app code.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Automating model refresh cycles.

🧠 Smart retraining = stable performance.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

🧠 Manage models like a portfolio.

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Reducing deployment risks and

🧠 Test in production without breaking

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Continuous improvement and

Streamlit + training pipelines

🧠 Test in production without breaking

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Prioritizing what's most important

Latency Accuracy Serverless Kubernetes

🧠 Every design choice has a trade-off — choose

Checkout my Topmate page https://topmate.io/lakshmimarikumar

Save For Later

You might also like