Lecture 3 - 1-ML and Data Systems Fundamentals
Lecture 3 - 1-ML and Data Systems Fundamentals
1
10/6/24
Agenda
1. ML systems fundamentals
2. Decoupling objectives
3. Breakout exercise
4. Data engineering 101
1. ML Systems Fundamentals
2
10/6/24
ML in production: expectation
1. Collect data
2. Train model
3. Deploy model
ML in production: reality
1. Choose a metric to optimize
2. Collect data
3. Train model Step 15 and 17 are
4. Realize many labels are wrong -> relabel data
essential
5. Train model
6. Model performs poorly on one class -> collect more data for that class
7. Train model
8. Model performs poorly on most recent data -> collect more recent data
9. Train model
10. Deploy model
11. Dream about $$$
12. Wake up at 2am to complaints that model biases against one group -> revert to older version
13. Get more data, train more, do more testing
14. Deploy model
15. Pray
16. Model performs well but revenue decreasing
17. Cry
18. Choose a different metric
19. Start over
3
10/6/24
Project considerations
1.Framing
2.Objectives
3.Constraints
4.Phases
Task type
Regression Classification
4
10/6/24
Task type
Regression Classification
0 0 0 1 0 1 0 1
Model 1: Model 2:
…
Does this Does this
belong to belong to
class 1? class 2?
10
10
5
10/6/24
Model 1: Model 2:
…
Does this Does this
belong to belong to
class 1? class 2?
11
11
Poll:
Which classes should this Model 1: Model 2:
…
example belong to? Does this Does this
1. 0 belong to belong to
2. 0, 1 class 1? class 2?
3. 0, 1, 2
12
12
6
10/6/24
Problem: predict the app users will most likely open next
Classification
0.2 App 0
… …
0.04
13
13
Problem: predict the app users will most likely open next
Classification
0.2 App 0
… …
0.04
14
14
7
10/6/24
Problem: predict the app users will most likely open next
Regression
OUTPUT
15
15
Regression
OUTPUT
16
16
8
10/6/24
Project objectives
● ML objectives
● Business objectives
17
17
Project objectives
● ML objectives
○ Performance How to evaluate
○ Latency accuracy/F1/etc. without ground
○ etc. truth labels?
18
18
9
10/6/24
Project objectives
● ML objectives
○ Performance
○ Latency
○ etc.
● Business objectives
○ Cost
○ ROI
○ Regulation & compliance
19
19
Project objectives
● ML objectives
○ Performance
○ Latency
○ etc.
● Business objectives
○ Cost
○ ROI
○ Regulation & compliance
20
20
10
10/6/24
Business objectives
21
21
customers’ problems
customers happier
solved faster
22
11
10/6/24
● Baselines
○ Existing solutions, simple solutions, human experts, competitors solutions, etc.
23
23
● Baselines
● Usefulness threshold
○ Self-driving needs human-level performance. Predictive texting doesn’t.
24
24
12
10/6/24
● Baselines
● Usefulness threshold
● False negatives vs. false positives
○ Covid screening: no false negative (patients with covid shouldn’t be classified as no covid)
○ Fingerprint unlocking: no false positive (unauthorized people shouldn’t be given access)
25
25
● Baselines
● Usefulness threshold
● False negatives vs. false positives
● Interpretability
○ Does the ML system need to be interpretable? If yes, to whom?
26
26
13
10/6/24
● Baselines
● Usefulness threshold
● False negatives vs. false positives
● Interpretability
● Confidence measurement (how confident it is about a prediction)
○ Does it need confidence measurement?
○ Is there a confidence threshold? What to do with predictions below that threshold—discard
it, loop in humans, or ask for more information from users?
27
27
● Baselines
○ Existing solutions, simple solutions, human experts, competitors solutions, etc.
● Usefulness threshold
○ Self-driving needs human-level performance. Predictive texting doesn’t.
● False negatives vs. false positives
○ Covid screening: no false negative (patients with covid shouldn’t be classified as no covid)
○ Fingerprint unlocking: no false positive (unauthorized people shouldn’t be given access)
● Interpretability
○ Does it need to be interpretable? If yes, to whom?
● Confidence measurement (how confident it is about a prediction)
○ Does it need confidence measurement?
○ Is there a confidence threshold? What to do with predictions below that threshold—discard
it, loop in humans, or ask for more information from users?
28
28
14
10/6/24
● Time
○ Rule of thumb: 20% time to get initial working system, 80% on iterative development
● Budget
○ Data, resources, talent
Time/budget tradeoffs
29
Constraints: privacy
● Annotation
○ Can data be shipped outside organizations for annotation?
● Storage
○ What kind of data are you allowed to store? How long can you store it?
● Third-party solutions
○ Can you share your data with a 3rd party (e.g. managed service)?
● Regulations
○ What regulations do you have to conform to?
30
30
15
10/6/24
Technical constraints
● Competitors
● Legacy systems
31
31
32
32
16
10/6/24
Phase 1: Before ML
“If you think that machine learning will give you a 100% boost, then a
heuristic will get you 50% of the way there.”
33
33
34
https://newsfeed.org/what-mark-zuckerbergs-news-feed-looked-like-in-2006/
34
17
10/6/24
Start with a simple model that allows visibility into its working to:
● validate hypothesis
● validate pipeline
35
35
36
36
18
10/6/24
37
37
2. Decoupling objectives
38
38
19
10/6/24
Decoupling objectives
39
39
Facebook Employee Raises Powered by ‘Really Dangerous’ Algorithm That Favors Angry Posts (SFist, 2019) 40
The Making of a YouTube Radical (NYT, 2019)
40
20
10/6/24
Step-by-step objectives:
41
41
Wholesome newsfeed
Goal: maximize users’ engagement while minimizing the spread of extreme views and misinformation
Step-by-step objectives:
42
42
21
10/6/24
Decoupling objectives
Goal: maximize users’ engagement while minimizing the spread of extreme views and misinformation
Step-by-step objectives:
43
44
44
22
10/6/24
45
46
A Neural Algorithm of Artistic Style (Gatys et al, 2017)
46
23
10/6/24
47
M q: optimizes quality_loss
M e: optimizes engagement_loss
48
48
24
10/6/24
49
49
3. Breakout exercise
50
50
25
10/6/24
4. Data Engineering
52
52
Data engineering
● Data sources
● Data formats
● Data models
● Data storage engines & processing
53
53
26
10/6/24
Data sources
● User generated
● Systems generated
● Internal databases: users, inventory, customer relationships
● Third-party data
54
54
Data sources
55
27
10/6/24
● Types of data
○ social media, income, job
● Demographic group
○ men, age 25-34, work in tech
● More available with Mobile Advertiser ID
● Useful for learning features
○ people who like A also like B
56
onaudience.com/audience-data
56
57
57
28
10/6/24
Or is this?
58
TikTok wants to keep tracking iPhone users with state-backed workaround | Ars Technica
58
59
59
29
10/6/24
60
60
61
61
30
10/6/24
Data formats
62
62
Column-major:
● stored and retrieved column-by-column
● good for accessing features
63
3
31
10/6/24
64
https://github.com/chiphuyen/just-pandas-things
64
Store the number 1000000? 7 characters -> 7 bytes If stored as int32, only 4 bytes
65
65
32
10/6/24
Data models
66
66
Tuple (row):
unordered
Column 1 Column 2 Column 3 .... Heading
Column:
unordered 67
67
33
10/6/24
68
68
2 Guava Press US
69
69
34
10/6/24
70
70
71
71
35
10/6/24
SQL
LEARN SQL!
72
72
73
73
36
10/6/24
SQL to NoSQL
74
https://blog.couchbase.com/nosql-adoption-survey-surprises/
74
● Document model
● Graph model
75
75
37
10/6/24
NoSQL
● Document model
○ Central concept: document
○ Relationships between documents are rare
● Graph model
○ Central concept: graph (nodes & edges)
○ Relationships are the priority
76
76
77
77
38
10/6/24
Graph model
type: country
name: USA
within
type: state
type: country
name:
name: France
California
within
within within
78
78
within
type: state
type: country
name:
name: France
California
within
within within
79
79
39
10/6/24
Structured Unstructured
80
Structured Unstructured
81
81
40
10/6/24
Transactional Analytical
processing processing
82
82
83
83
41
10/6/24
84
84
85
85
42
10/6/24
86
86
87
87
43
10/6/24
88
88
SELECT AVG(Price)
Column FROM RideTable
WHERE City = 'Stanford' AND Month = 'July';
89
89
44
10/6/24
90
90
● OLTP & OLAP: how data is stored is also how it’s processed
○ Same data being stored in multiple databases
○ Each uses a different processing engine for different query types
● New paradigm: storage is decoupled from processing
○ Data can be stored in the same place
○ A processing layer on top that can be optimized for different query types
91
91
45
10/6/24
92
https://hevodata.com/blog/snowflake-architecture-cloud-data-warehouse/
92
93
93
46
10/6/24
Extract,
Transform,
OLTP Load OLAP
94
94
95
https://commons.wikimedia.org/wiki/File:Extract,_Transform,_Load_Data_Flow_Diagram.svg
95
47
10/6/24
96
96
THANK YOU !
97
97
48