KEMBAR78
Lecture 3 - 1-ML and Data Systems Fundamentals | PDF | Relational Model | No Sql
0% found this document useful (0 votes)
16 views48 pages

Lecture 3 - 1-ML and Data Systems Fundamentals

The document outlines the fundamentals of machine learning (ML) systems, including the differences between expectations and realities in ML production, project considerations, and objectives. It emphasizes the importance of framing problems correctly, understanding task types, and the need for a structured approach to ML adoption in four phases. Additionally, it discusses the complexities of decoupling objectives in ML, particularly in ranking systems, and highlights the significance of balancing multiple objectives for effective outcomes.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views48 pages

Lecture 3 - 1-ML and Data Systems Fundamentals

The document outlines the fundamentals of machine learning (ML) systems, including the differences between expectations and realities in ML production, project considerations, and objectives. It emphasizes the importance of framing problems correctly, understanding task types, and the need for a structured approach to ML adoption in four phases. Additionally, it discusses the complexities of decoupling objectives in ML, particularly in ranking systems, and highlights the significance of balancing multiple objectives for effective outcomes.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

10/6/24

ML and Data Systems


Fundamentals

1
10/6/24

Agenda

1. ML systems fundamentals
2. Decoupling objectives
3. Breakout exercise
4. Data engineering 101

1. ML Systems Fundamentals

2
10/6/24

ML in production: expectation

1. Collect data
2. Train model
3. Deploy model

ML in production: reality
1. Choose a metric to optimize
2. Collect data
3. Train model Step 15 and 17 are
4. Realize many labels are wrong -> relabel data
essential
5. Train model
6. Model performs poorly on one class -> collect more data for that class
7. Train model
8. Model performs poorly on most recent data -> collect more recent data
9. Train model
10. Deploy model
11. Dream about $$$
12. Wake up at 2am to complaints that model biases against one group -> revert to older version
13. Get more data, train more, do more testing
14. Deploy model
15. Pray
16. Model performs well but revenue decreasing
17. Cry
18. Choose a different metric
19. Start over

3
10/6/24

Project considerations

1.Framing
2.Objectives
3.Constraints
4.Phases

Framing the problem

Task type

Regression Classification

Binary Multiclass Multilabel

Low cardinality High cardinality


8

4
10/6/24

Multiclass vs. multilabel

Task type

Regression Classification

Binary Multiclass Multilabel

0 0 0 1 0 1 0 1

A label can belong to A label can belong to


only one class multiple classes
9

How to handle multilabel tasks

Multilabel problem solution

A multiclass problem A set of multiple binary


problems
0 1 0 1

Model 1: Model 2:

Does this Does this
belong to belong to
class 1? class 2?

10

10

5
10/6/24

Multilabel is harder than multiclass

Multilabel problem solution


1. How to create
ground truth labels?
2. How to decide
decision boundaries

A multiclass problem A set of multiple binary


problems
0 1 0 1

Model 1: Model 2:

Does this Does this
belong to belong to
class 1? class 2?

11

11

Multilabel: decision boundaries

Multilabel problem solution

A multiclass problem A set of multiple binary


0 1 2 3
problems
0.45 0.33 0.2 0.02

Poll:
Which classes should this Model 1: Model 2:

example belong to? Does this Does this
1. 0 belong to belong to
2. 0, 1 class 1? class 2?

3. 0, 1, 2
12

12

6
10/6/24

A problem can be framed as different task types

Problem: predict the app users will most likely open next

Classification

INPUT 0.07 0.15 … … 0.06 0.15


2 7 4

User’s features Environment


time, location, etc.

0.2 App 0

OUTPUT 0.02 App 1

… …

0.04
13

13

A problem can be framed as different task types

Problem: predict the app users will most likely open next

Classification

INPUT 0.07 0.15 … … 0.06 0.15 ⚠ Every time an app is


2 7 4
added/removed, you have to
User’s features Environment retrain your model ⚠
time, location, etc.

0.2 App 0

OUTPUT 0.02 App 1

… …

0.04
14

14

7
10/6/24

Framing can make the problem easier/harder

Problem: predict the app users will most likely open next

Regression
OUTPUT

INPUT 0 0.07 0.15 … … 0.06 0.15 0.03 App 0


2 7 4

User’s features Environment App’s features


time, location, etc.

INPUT 1 0.07 0.15 … … 0.06 0.15 0.06 App 1


2 7 4

INPUT … 0.07 0.15 … … 0.06 0.15 0.25 App …


2 7 4

15

15

Framing can make the problem easier/harder

Very common framing for


Problem: predict the app users will most likely open next recommendations / ads CTR

Regression
OUTPUT

INPUT 0 0.07 0.15 … … 0.06 0.15 0.03 App 0


2 7 4

User’s features Environment App’s features


time, location, etc.

INPUT 1 0.07 0.15 … … 0.06 0.15 0.06 App 1


2 7 4

INPUT … 0.07 0.15 … … 0.06 0.15 0.25 App …


2 7 4

16

16

8
10/6/24

Project objectives

● ML objectives
● Business objectives

17

17

Project objectives

● ML objectives
○ Performance How to evaluate
○ Latency accuracy/F1/etc. without ground
○ etc. truth labels?

18

18

9
10/6/24

Project objectives

● ML objectives
○ Performance
○ Latency
○ etc.
● Business objectives
○ Cost
○ ROI
○ Regulation & compliance

19

19

Project objectives

● ML objectives
○ Performance
○ Latency
○ etc.
● Business objectives
○ Cost
○ ROI
○ Regulation & compliance

20

20

10
10/6/24

Business objectives

How can this ML project increase profits directly or indirectly?

● Directly: increasing sales (ads, conversion rates), cutting costs


● Indirectly: increasing customer satisfaction, increasing time spent on a website

21

21

ML <-> business: can be tricky

ML model gives customers more personalized solutions

customers’ problems
customers happier
solved faster

customers spending customers spending


more money less money
22

22

11
10/6/24

ML <-> business: mapping

● Baselines
○ Existing solutions, simple solutions, human experts, competitors solutions, etc.

23

23

ML <-> business: mapping

● Baselines
● Usefulness threshold
○ Self-driving needs human-level performance. Predictive texting doesn’t.

24

24

12
10/6/24

ML <-> business: mapping

● Baselines
● Usefulness threshold
● False negatives vs. false positives
○ Covid screening: no false negative (patients with covid shouldn’t be classified as no covid)
○ Fingerprint unlocking: no false positive (unauthorized people shouldn’t be given access)

25

25

ML <-> business: mapping

● Baselines
● Usefulness threshold
● False negatives vs. false positives
● Interpretability
○ Does the ML system need to be interpretable? If yes, to whom?

26

26

13
10/6/24

ML <-> business: mapping

● Baselines
● Usefulness threshold
● False negatives vs. false positives
● Interpretability
● Confidence measurement (how confident it is about a prediction)
○ Does it need confidence measurement?
○ Is there a confidence threshold? What to do with predictions below that threshold—discard
it, loop in humans, or ask for more information from users?

27

27

ML <-> business: mapping

● Baselines
○ Existing solutions, simple solutions, human experts, competitors solutions, etc.
● Usefulness threshold
○ Self-driving needs human-level performance. Predictive texting doesn’t.
● False negatives vs. false positives
○ Covid screening: no false negative (patients with covid shouldn’t be classified as no covid)
○ Fingerprint unlocking: no false positive (unauthorized people shouldn’t be given access)
● Interpretability
○ Does it need to be interpretable? If yes, to whom?
● Confidence measurement (how confident it is about a prediction)
○ Does it need confidence measurement?
○ Is there a confidence threshold? What to do with predictions below that threshold—discard
it, loop in humans, or ask for more information from users?

28

28

14
10/6/24

Constraints: time & budget

● Time
○ Rule of thumb: 20% time to get initial working system, 80% on iterative development
● Budget
○ Data, resources, talent

Time/budget tradeoffs

● Use more (powerful) machines


● Hire more people to label data faster
● Run more experiments in parallel
● Buy existing solutions
29

29

Constraints: privacy

● Annotation
○ Can data be shipped outside organizations for annotation?
● Storage
○ What kind of data are you allowed to store? How long can you store it?
● Third-party solutions
○ Can you share your data with a 3rd party (e.g. managed service)?
● Regulations
○ What regulations do you have to conform to?

30

30

15
10/6/24

Technical constraints

● Competitors
● Legacy systems

31

31

Four phases of ML adoption

32

32

16
10/6/24

Phase 1: Before ML

“If you think that machine learning will give you a 100% boost, then a
heuristic will get you 50% of the way there.”

Martin Zinkevich, Google

33

33

34
https://newsfeed.org/what-mark-zuckerbergs-news-feed-looked-like-in-2006/

34

17
10/6/24

Phase 2: Simplest ML models

Start with a simple model that allows visibility into its working to:

● validate hypothesis
● validate pipeline

35

35

Phase 3: Optimizing simple models

● Different objective functions


● Feature engineering
● More data
● Ensembling

36

36

18
10/6/24

Phase 4: Complex ML models

37

37

2. Decoupling objectives

38

38

19
10/6/24

Decoupling objectives

Possible high-level goals when building a ranking system for newsfeed?

1. minimize the spread of misinformation


2. maximize revenue from sponsored content
3. maximize engagement

Zoom poll: which goal would you choose?

39

39

Side note: ethics of maximizing engagement

Facebook Employee Raises Powered by ‘Really Dangerous’ Algorithm That Favors Angry Posts (SFist, 2019) 40
The Making of a YouTube Radical (NYT, 2019)

40

20
10/6/24

Goal: maximize engagement

Step-by-step objectives:

1. Filter out spam


2. Filter out NSFW content
3. Rank posts by engagement: how likely users will click on them

41

41

Wholesome newsfeed

Goal: maximize users’ engagement while minimizing the spread of extreme views and misinformation

Step-by-step objectives:

1. Filter out spam


2. Filter out NSFW content
3. Filter out misinformation
4. Rank posts by quality
5. Rank posts by engagement: how likely users will click on them

42

42

21
10/6/24

Decoupling objectives

Goal: maximize users’ engagement while minimizing the spread of extreme views and misinformation

Step-by-step objectives:

1. Filter out spam


2. Filter out NSFW content
3. Filter out misinformation
4. Rank posts by quality
5. Rank posts by engagement: how likely users will click on it

How to rank posts by both


quality & engagement? 43

43

Multiple objective optimization (MOO)

● Rank posts by quality


○ Predict posts’ quality
○ Minimize quality_loss: difference between predicted quality and true quality

● Rank posts by how likely users will click on it


○ Predict posts’ engagement
○ Minimize engagement_loss: difference between predicted clicks and true clicks

44

44

22
10/6/24

One model optimizing combined loss

● Rank posts by quality


○ Predict posts’ quality
○ Minimize quality_loss: difference between predicted quality and true quality

● Rank posts by how likely users will click on it


○ Predict posts’ engagement
○ Minimize engagement_loss: difference between predicted clicks and true clicks

loss = 𝛼 quality_loss + 𝛽 engagement_loss

Train one model to minimize this combined loss


Tune 𝛼 and 𝛽 to meet your need

Side note 1: check out Pareto optimization if you


want to learn about how to choose 𝛼 and 𝛽
45

45

One model optimizing combined loss

● Rank posts by quality


○ Predict posts’ quality
○ Minimize quality_loss: difference between predicted quality and true quality

● Rank posts by how likely users will click on it


○ Predict posts’ engagement
○ Minimize engagement_loss: difference between predicted clicks and true clicks

loss = 𝛼 quality_loss + 𝛽 engagement_loss

Train one model to minimize this combined loss

Side note 2: this is quite common, e.g. style transfer

46
A Neural Algorithm of Artistic Style (Gatys et al, 2017)

46

23
10/6/24

One model optimizing combined loss

● Rank posts by quality


○ Predict posts’ quality
○ Minimize quality_loss: difference between predicted quality and true quality

● Rank posts by how likely users will click on it


○ Predict posts’ engagement
○ Minimize engagement_loss: difference between predicted clicks and true clicks

loss = 𝛼 quality_loss + 𝛽 engagement_loss

Train one model to minimize this combined loss

⚠ Every time you want to tweak 𝛼


and 𝛽, you have to retrain your model!
⚠ 47

47

Multiple models: each optimizing one objective

● Rank posts by quality


○ Predict posts’ quality
○ Minimize quality_loss: difference between predicted quality and true quality

● Rank posts by how likely users will click on it


○ Predict posts’ engagement
○ Minimize engagement_loss: difference between predicted clicks and true clicks

M q: optimizes quality_loss
M e: optimizes engagement_loss

Rank posts by 𝛼 M q(post) + 𝛽 M e(post)

Now you can tweak 𝛼 and 𝛽 without retraining models

48

48

24
10/6/24

Decouple different objectives

● Easier for training:


○ Optimizing for one objective is easier than optimizing for multiple objectives
● Easier to tweak your system:
○ E.g. 𝛼 % model optimized for quality + 𝛽 % model optimized for engagement
● Easier for maintenance:
○ Different objectives might need different maintenance schedules
■ Spamming techniques evolve much faster than the way post quality is perceived
■ Spam filtering systems need updates more frequently than quality ranking systems

49

49

3. Breakout exercise

50

50

25
10/6/24

4. Data Engineering

Very basic. For details, take a database class!

52

52

Data engineering

● Data sources
● Data formats
● Data models
● Data storage engines & processing

53

53

26
10/6/24

Data sources

● User generated
● Systems generated
● Internal databases: users, inventory, customer relationships
● Third-party data

54

54

Data sources

Users generated data Systems generated data


User inputs Logs, metadata, predictions

Easily mal-formatted Easier to standardize

Need to be processed ASAP OK to process periodically


(unless to detect problems ASAP)

Can grow very large very quickly

● Many tools to process & analyze logs:


Logstash, DataDog, Logz, etc.
● OK to delete when no longer useful

Users’ behavioral data (clicks, time spent, etc.) is often


system-generated but is considered user data
55

55

27
10/6/24

Third-party data: creepy but fascinating

● Types of data
○ social media, income, job
● Demographic group
○ men, age 25-34, work in tech
● More available with Mobile Advertiser ID
● Useful for learning features
○ people who like A also like B

56
onaudience.com/audience-data

56

The end of tracking IDs …

57

57

28
10/6/24

Or is this?

58
TikTok wants to keep tracking iPhone users with state-backed workaround | Ars Technica

58

How to store your data?

Storing your data is only interesting if you want to access it later

● Storing data: serialization


● Unloading data: deserialization

59

59

29
10/6/24

How to store your data?

Data formats are


agreed upon standards
to serialize your data so that
it can be transmitted & reconstructed later

60

60

Data formats: questions to consider

● How to store multimodal data?


○ {‘image’: [[200,155,0], [255,255,255], ...], ‘label’: ‘car’, ‘id’: 1}
● Access patterns
○ How frequently the data will be accessed?
● The hardware the data will be run on
○ Complex ML models on TPU/GPU/CPU

61

61

30
10/6/24

Data formats

Format Binary/Text Human-readable Example use cases

JSON Text Yes Everywhere

Row-major CSV Text Yes Everywhere

Column-major Parquet Binary No Hadoop, Amazon Redshift

Avro Binary primary No Hadoop

Protobuf Binary primary No Google, TensorFlow (TFRecord)

Pickle Binary No Python, PyTorch serialization

62

62

Row-major vs. column-major

Column-major:
● stored and retrieved column-by-column
● good for accessing features

Colum Colum Column


Row-major: n1 n2 3
● stored and retrieved row-
by-row Sample ... ... ...
good for accessing samples

1
Sample ... ... ...
2
Sample ... ... ... 63

63
3

31
10/6/24

Row-major vs. column-major: DataFrame vs. ndarray

Pandas DataFrame: column-major


● accessing a row much slower than
accessing a column and NumPy

NumPy ndarray: row-major by default


● can specify to be column-based

64
https://github.com/chiphuyen/just-pandas-things

64

Text vs. binary formats

Text files Binary files

Examples CSV, JSON Parquet

Pros Human readable Compact

Store the number 1000000? 7 characters -> 7 bytes If stored as int32, only 4 bytes

65

65

32
10/6/24

Data models

● Describe how data is represented


● Two main paradigms:
○ Relational model
○ NoSQL

66

66

Relational model (est. 1970)

● Similar to SQL model


● Formats: CSV, Parquet

Tuple (row):
unordered
Column 1 Column 2 Column 3 .... Heading

Column:
unordered 67

67

33
10/6/24

Relational model: normalization

What if we change “Banana Press” to “Pineapple Press”?

Title Author Format Publisher Country Price Original Book


Harry Potter J.K. Rowling Paperback Banana Press UK $20 Relation
Harry Potter J.K. Rowling E-book Banana Press UK $10
Sherlock Holmes Conan Doyle Paperback Guava Press US $30
The Hobbit J.R.R. Tolkien Paperback Banana Press US $30
Sherlock Holmes Conan Doyle Paperback Guava Press US $15

68

68

Relational model: normalization

Title Author Format Publisher ID Price Updated Book


Harry Potter J.K. Rowling Paperback 1 $20 Relation
Harry Potter J.K. Rowling E-book 1 $10
Sherlock Holmes Conan Doyle Paperback 2 $30
The Hobbit J.R.R. Tolkien Paperback 1 $30
Sherlock Holmes Conan Doyle Paperback 2 $15

Publisher ID Publisher Country Publisher


1 Banana Press UK Relation

2 Guava Press US

69

69

34
10/6/24

Relational model: normalization

Title Author Format Publisher ID Price Pros:


Harry Potter J.K. Rowling Paperback 1 $20 ● Less mistakes
(standardized spelling)
Harry Potter J.K. Rowling E-book 1 $10 ● Easier to update
Sherlock Holmes Conan Doyle Paperback 2 $30 ● Easier localization
The Hobbit J.R.R. Tolkien Paperback 1 $30
Sherlock Holmes Conan Doyle Paperback 2 $15

Publisher ID Publisher Country


Cons:
● Slow to join across
1 Banana Press UK multiple large tables
2 Guava Press US

70

70

Relational Model & SQL Model

● SQL model slightly differs from relational model


○ e.g. SQL tables can contain row duplicates. True relations can’t.
● SQL is a query language
○ How to specify the data that you want from a database
● SQL is declarative
○ You tell the data system what you want
○ It’s up to the system to figure out how to execute
■ Query optimization

71

71

35
10/6/24

SQL

● SQL is an essential data scientists’ tool

LEARN SQL!

72

72

Problems with SQL

● What if we add a new column?


● What if we change a column type?

73

73

36
10/6/24

SQL to NoSQL

74
https://blog.couchbase.com/nosql-adoption-survey-surprises/

74

NoSQL: No SQL -> Not Only SQL

● Document model
● Graph model

75

75

37
10/6/24

NoSQL

● Document model
○ Central concept: document
○ Relationships between documents are rare
● Graph model
○ Central concept: graph (nodes & edges)
○ Relationships are the priority

76

76

Document model: example

● Book data in the document model


● Each book is a document

77

77

38
10/6/24

Graph model

type: country
name: USA

within

type: state
type: country
name:
name: France
California
within
within within

type: city type: city type: city


name: Paris name: Palo Alto born_in name: Stanford

lives_in born_in lives_in

type: person coworker type: person friend type: person


name: Kinbert Chou name: Megan Leszczynski name: Chloe He

78

78

Graph model Query: show me everyone who


was born in the USA?
● Easy in graph
● Difficult in SQL
type: country
name: USA

within

type: state
type: country
name:
name: France
California
within
within within

type: city type: city type: city


name: Paris name: Palo Alto born_in name: Stanford

lives_in born_in lives_in

type: person coworker type: person friend type: person


name: Kinbert Chou name: Megan Leszczynski name: Chloe He

79

79

39
10/6/24

Structured vs. unstructured data

Structured Unstructured

Schema clearly defined Whatever


Easy to search and analyze Fast arrival (e.g. no need to
clean up first)
Can only handle data with Can handle data from any
specific schema source
Schema changes will cause a No need to worry about schema
lot of trouble changes
Data warehouses Data lakes 80

80

Structured vs. unstructured data

Structured Unstructured

Structure is assumed at write Structure is assumed at read

81

81

40
10/6/24

Data Storage Engines & Processing

Databases optimized for

Transactional Analytical
processing processing

82

82

OnLine Transaction Processing (OLTP)

● Transactions: tweeting, ordering a Lyft, uploading a new model, etc.


● Operations:
○ Insert when generated
○ Occasional update/delete

83

83

41
10/6/24

OnLine Transaction Processing

● Transactions: tweeting, ordering a Lyft, uploading a new model, etc.


● Operations:
○ Inserted when generated
○ Occasional update/delete
● Requirements
○ Low latency
○ High availability

84

84

OnLine Transaction Processing

● Transactions: tweeting, ordering a Lyft, uploading a new model, etc.


● Operations:
○ Inserted when generated
○ Occasional update/delete
● Requirements
○ Low latency
○ High availability See ACID:
○ ACID not necessary Atomicity,
■ Atomicity: all the steps in a transaction fail or succeed as a group Consistency,
● If payment fails, don’t assign a driver Isolation,
■ Isolation: concurrent transactions happen as if sequential Durability
● Don’t assign the same driver to two different requests that happen at the same time

85

85

42
10/6/24

OnLine Transaction Processing

● Transactions: tweeting, ordering a Lyft, uploading a new model, etc.


● Operations:
○ Inserted when generated
○ Occasional update/delete
● Requirements
○ Low latency
○ High availability
● Typically row-major

INSERT INTO RideTable(RideID, Username, DriverID, City, Month, Price)


Row VALUES ('10', 'memelord', '3932839', 'Stanford', 'July', '20.4');

86

86

OnLine Analytical Processing (OLAP)

● How to get aggregated information from a large amount of data?


○ e.g. what’s the average ride price last month for riders at Stanford?
● Operations:
○ Mostly SELECT

87

87

43
10/6/24

OnLine Analytical Processing

● Analytical queries: aggregated information from a large amount of data?


○ e.g. what’s the average ride price last month for riders at Stanford?
● Operations:
○ Mostly SELECT
● Requirements:
○ Can handle complex queries on large volumes of data
○ Okay response time (seconds, minutes, even hours)

88

88

OnLine Analytical Processing

● Analytical queries: aggregated information from a large amount of data?


○ e.g. what’s the average ride price last month for riders at Stanford?
● Operations:
○ Mostly SELECT
● Requirements:
○ Can handle complex queries on large volumes of data
○ Okay response time (seconds, minutes, even hours)
● Typically column-major

SELECT AVG(Price)
Column FROM RideTable
WHERE City = 'Stanford' AND Month = 'July';

89

89

44
10/6/24

OLTP & OLAP are outdated terms

90

90

Decoupling storage & processing

● OLTP & OLAP: how data is stored is also how it’s processed
○ Same data being stored in multiple databases
○ Each uses a different processing engine for different query types
● New paradigm: storage is decoupled from processing
○ Data can be stored in the same place
○ A processing layer on top that can be optimized for different query types

91

91

45
10/6/24

Decoupling storage & processing

92
https://hevodata.com/blog/snowflake-architecture-cloud-data-warehouse/

92

93

93

46
10/6/24

ETL (Extract, Transform, Load)

Extract,
Transform,
OLTP Load OLAP

Transform: the meaty part


● cleaning, validating, transposing, deriving values, joining from multiple
sources, deduplicating, splitting, aggregating, etc.

94

94

Often done in batch

95
https://commons.wikimedia.org/wiki/File:Extract,_Transform,_Load_Data_Flow_Diagram.svg

95

47
10/6/24

ETL -> ELT

Structured -> unstructured -> structured


want more flexibility tools & infra
standardized

ETL -> ELT -> ETL

96

96

THANK YOU !

97

97

48

You might also like