Week 8:
Machine Learning (in a Nutshell)
Winter term 2020 / 2021
Chair for Computational Social Science and Humanities
Markus Strohmaier, Florian Lemmerich, and Tobias Schumacher
Where are we?
Florian Lemmerich
Social Data Science
Sources and Resources
➢ Sara Hajian (Algorithmic Bias, KDD Tutorial 2016)
➢ S. Bird et al: Fairness-Aware Machine Learning in Practice
https://sites.google.com/view/fairness-tutorial
➢ S. Barocas and M. Hardt: Fair Machine Learning https://vimeo.com/248490141
➢ S. Hajian et al.: Discrimination-Aware Machine Learning
https://www.youtube.com/watch?v=mJcWrfoGup8
➢ http://www.mlandthelaw.org/slides/hajian.pdf
Florian Lemmerich
Social Data Science
3
Agenda
➢ Machine Learning in a Nutshell
▪ Overview
▪ Classification
▪ Clustering
▪ Where does ML work?
Florian Lemmerich
4 Social Data Science
8.1 Overview
5
Machine Learning
“Machine learning is the scientific study of algorithms and statistical
models that computer systems use to perform a specific task without
using explicit instructions, relying on patterns and inference instead.“
[Wikipedia]
➢ Subset of Artificial Intelligence
➢ Enhance AI systems by extracting patterns from raw data instead of using just
hard coded rules
➢ Applications oriented
➢ ”Statistics with theory and assumptions dropped” (?)
Florian Lemmerich
6 Social Data Science
Comparison of Machine Learning and Data Mining
➢ Strong overlap
➢ Data Mining: Emphasis on (interactive) exploration and discovery
➢ ML: Emphasis on prediction of new events
➢ Data Mining uses Machine Learning methods and vice versa
➢ “Data science” as an umbrella term?
Florian Lemmerich
7 Social Data Science
MLDM tasks
➢ Classification
➢ Regression (=Numeric Prediction)
➢ Clustering
➢ Association Rule Mining
➢ …
➢ On data types
▪ Tabular data
▪ Text
▪ Networks
▪ Images
▪ Sequences
▪ Time Series
▪ …
Florian Lemmerich
8 Social Data Science
8.1 Classification (+Regression)
9
Classification problems
➢ Given:
▪ A set of objects (data points)
▪ For each object a set of features (attribute values) (x1, . . ., xd)
▪ For each object a class (label) c ∈ C = {c1 , . . ., ck}
“supervised technique“
➢ Goal:
Find correct class for new objects (given only the features)
➢ Difference to Clustering:
▪ Classification:
• classes are known a-priori,
• There is training data with known labels
▪ Clustering: clusters (=classes) have to be identified
Florian Lemmerich
10 Social Data Science
Classification Process
1. Training phase
Classification
Algorithm
Training
data
NAME RANK YEARS TENURED Classifier
Mike Assistant Prof 3 no (model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes if rank = ‘professor’
Dave Assistant Prof 6 no or years > 6
Anne Associate Prof 3 no then tenured = ‘yes’
2. Application phase
(Test phase) Unknown data Classifier
(Jeff, Professor, 4)
… Tenured?
… yes
Florian Lemmerich
11 Social Data Science
Evaluation scenario
Ground Truth
positive negative
positive True False
Pred.
positive positive
negative False True
negative negative
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+ 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
➢ Accuracy: 𝑎𝑙𝑙
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
➢ Precision: 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒
➢ Recall:
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
➢ F1-Measure: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
Florian Lemmerich
12 Social Data Science
Beyond Logistic Regression
➢ We can solve Classification problems with Logistic Regression
➢ Why do we need something else?
➢ Logistic Regression has certain assumptions
▪ E.g., Independence of features
▪ ➔ Complex relationships between variables can’t be captured
▪ Linear classification boundary
▪ Other algorithms might be better
➢ Only two outcomes (classes) possible
➢ These do not hold in all datasets
Florian Lemmerich
13 Social Data Science
Algorithms
➢ Naive Bayes
➢ Nearest Neighbor
➢ Decision Trees
➢ SVM
➢ Neural Networks
➢ Random Forest
➢ Boosting
Florian Lemmerich
14 Social Data Science
Naive Bayes
➢ Probability-based classifier:
Assign object o=(x1, …, xd ) to class cj such that P(cj|o) is maximized:
argmax P(c j | o)
➢ P(cj|o) is unknown! c j C
➢ Use Bayes‘ Theorem:
P (o | c j ) P (c j )
argmax P (c j | o) = argmax = argmax P (o | c j ) P (c j )
c j C c j C P (o ) c j C
➢ We can determine P (cj) in the training data by simple counting
➢ To compute P(o|cj) assume that all features are independent:
d
P(o | c j ) = P( x1 | c j ) ... P( xd | c j ) = P( xi | c j )
i =1
➢ Overall decision rule for the Naïve Bayes classifier:
d
argmax P(c j ) P( xi | c j )
c j C i =1
Florian Lemmerich
15 Social Data Science
Example Naive Bayes
Color Age Class
Red Old Good
White Old Good
Training
White Young Good
Red Old Bad
White Young Bad
Red Young ?
Compute values of the decision rule for both classes
Class “Good” :
P (Good) ⋅ P(Red| Good) ⋅ P(Young| Good) = 3/5 ⋅ 1/3 ⋅ 1/3 = 1/15
Class “Bad”:
P (Bad) ⋅ P(Red| Bad) ⋅ P(Young| Bad) = 2/5 ⋅ 1/2 ⋅ 1/2 = 1/10
Higher probability for class “Bad“
New instance is classified as “Bad”
Florian Lemmerich
16 Social Data Science
Nearest Neighbor Classification
➢ Define distances between data instances
➢ Specify a value k
➢ To classify a new data instance:
▪ Search the k most similar instances (k-nearest-neighbors)
▪ Select the majority class among those neighbors (“voting”)
▪ Option: Weight by distance
Close-by: High weight
far-away: Low weight
k=10
k=5
Florian Lemmerich
17 Social Data Science
Decision Trees
➢ Each inner node is an attribute NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
➢ Each edge is a value (range) of its source node Mary Assistant Prof 7 yes
Bill Professor 2 yes
➢ Each leaf node is a class label Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
➢ Learning: Greedy top-to-bottom Anne Associate Prof 3 no
➢ Prediction:
Follow the tree from top-to-bottom based on instance values
Rank
Professor Not professor
no Years
<7
no yes
Florian Lemmerich
18 Social Data Science
Decision Tree Induction
➢ Top down, recursive
➢ At each point: try to separate classes
k
➢ Formally: use Information Gain: entropy (T ) = − pi log pi
i =1
m
| Ti |
information _ gain(T , A) = entropy (T ) − entropy (Ti )
i =1 | T |
➢ Recurse until no more data or all in one class
➢ Many variations
▪ Only binary splits
▪ Alternatives to information gain
▪ Pruning Limit size of the decision tree to avoid overfitting
Florian Lemmerich
19 Social Data Science
Support Vector Machines
Florian Lemmerich
20 Social Data Science
Support Vector Machines
➢ Each data point is a point in d-dimensional space
➢ Find separating hyperplane that maximizes distance to instances
➢ Base form requires linear separability
➢ Extensions can handle that very well also
▪ Kernel methods: transformation in higher dimensional spaces
▪ Slack variables: Add penalty term if instance is false classified
Florian Lemmerich
21 Social Data Science
Idea: Kernels and SVM
Base data
x
Data in higher dimensional space
x2
22
0 x
Florian Lemmerich
22 Social Data Science
Neural Networks (“Deep Learning”)
➢ Connect many perceptrons with each other
➢ Based on Backpropagation learning (gradient descent)
x1
y
x2
x3
Input layer Hidden Layer 1 Hidden Layer 2 Output Layer
➢ Huge variety of architectures
➢ Requires large amounts of training data
➢ “Black box”
Florian Lemmerich
23 Social Data Science
No-Free-Lunch-theorem (Machine Learning edition)
➢ Given many arbitrary datasets
▪ No assumptions
▪ Labels randomly
➢ Then all Algorithms are equally good!
➔ In general, there is no “best algorithm”
➢ Algorithms make more or less reasonable assumptions
▪ Linearity of features
▪ “Similar” objects have likely a similar class, …
Florian Lemmerich
24 Social Data Science
Idea of Ensemble Learning
➢ Different classifiers make different errors
➢ Try to combine classifiers to improve classification accuracy!
➢ Definition Ensemble Learning:
“An ensemble of classifiers is a set of classifiers whose individual decisions are combined
in some way to classify new examples” (Dietterich, 2000)
Base learners Single predictions Final prediction
Classifier C1
Class = b
Classifier C2
Combining Class = b
Class = b
Classifier C3 mechanism
…
Classifier Cn
Florian Lemmerich
25 Social Data Science
Example: Ensemble Learning
➢ Can this work?
➢ Example with 3 classifier on a test set i1, …, i5
i1 I2 i3 i4 I5 Overall Accuracy
True class + + + - -
Classifier 1 + + + + - 0.8
Classifier 2 + + - - - 0.8
Classifier 3 - - + - - 0.6
Majority
voting + + + - - 1.0
➢ The ensemble classifies better than each individual classifier!
Florian Lemmerich
26 Social Data Science
Random Forests
➢ Ensemble method
➢ Build many slightly different trees
▪ Subset of instances for each tree (bootstrap sampling)
▪ Choose from a subset of attributes at each node
➢ Collect predictions for each tree
➢ Majority decides
Florian Lemmerich
27 Social Data Science
Boosting
➢ Train a simple classifier
➢ Identify misclassified instances of this classifier
➢ Increase the weight of these instances
➢ Iterate
➢ Overall classifier: Combination of all previously trained models
Florian Lemmerich
28 Social Data Science
Task Variation: Multiclass Classification
➢ More than two possible classes (outcomes)
➢ What to do?
➢ Some algorithms extend to this naturally
▪ Decision Trees
▪ Nearest Neighbor
▪ Naïve Bayes
➢ Others?
▪ One-vs-all: Train a classifier for each class – is it this class or something else?
▪ One-vs-one: Train a classifier for each pair of classes – which one is more likely?
Then, do a majority voting.
Florian Lemmerich
29 Social Data Science
29
Other Variations
➢ Multilabel classification
➢ Cost-sensitive classification
➢ Unbiased classification
Florian Lemmerich
30 Social Data Science
Regression
➢ We learned most of the basics already in this course
➢ For most classification approaches, there is also a regression pendant
➢ Examples:
▪ Regression trees
▪ Support Vector Regression (SVR)
▪ Random forest regressor
▪ Boosted trees
Florian Lemmerich
31 Social Data Science
8.2 Clustering
Clustering: Goal and Definition
➢ Identification of a finite set of clusters in the data
(= categories, “classes”, groups)
➢ Objects in the same cluster should be as similar as possible.
➢ Objects in different clusters should be as dissimilar as possible
➢ “Unsupervised learning” => no groups given
Gene sequences Books / documents
Language dialects
Websites
Political opinions
Florian Lemmerich
33 Social Data Science
What does “Similar“ Mean?
Florian Lemmerich
34 Social Data Science
Similarity?
Florian Lemmerich
35 Social Data Science
Distance functions & similarity functions
➢ Formalization of distance and similarity:
▪ Determine functions dist(o1,o2) or sim(o1,o2) (distance more common)
▪ Small distance Distanz similar Objects
➢ „Good“ distance measure is crucial, but depends on the application
➢ Common distance measures:
d
▪ Euklidische Distanz distEucl ( x, y ) = (x − y )
i =1
i i
2
d
▪ Manhattan-Distanz distMan( x, y ) = | xi − yi |
i =1
▪ Maximums-Metrik distMax( x, y ) = max{| xi − yi |1 i d }
▪ Hamming-Distance: Number of different attribute values
E.g., x= (young, teacher, yes); y=(old, student, yes);
➔ distHamming (x,y) = 2
Florian Lemmerich
36 Social Data Science
Example: Distance metrics
➢ distEucl (A,B) = 13
➢ distMan (A,B) = 5
xB
➢ distMax (A,B) = 3
Ax
Florian Lemmerich
37 Social Data Science
K-means
1. Choose the number of clusters k
2. Pick k initial centroids (cluster centers) randomly
3. Assign each data point to the nearest centroid
4. Compute the new centroids as mean (vector!) of all points
of a cluster
5. If (change) goto 3
Florian Lemmerich
38 Social Data Science
K-Means: Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
Compute centroids 3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Assign data points to clusters
10 10
9 9
8 8
7 7
6 6
5 5
4 4
2
Compute centroids 3
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
https://www.youtube.com/watch?v=-lu8Ku5Cw9E
Florian Lemmerich
39 Social Data Science
https://www.youtube.com/watch?v=BVFG7fd1H30
Density-based Clustering (DB-Scan)
➢ Idea:
▪ Cluster are dense regions in d-dimensional space
▪ Clusters are separated by sparse regions
➢ Requirements for a cluster: p
▪ Define neighborhoods
▪ Each data point in the cluster has n neighbors q
▪ Data points in one cluster are connected by “dense points“
Florian Lemmerich
40 Social Data Science
Hierarchical Clustering
➢ Goal: A hierarchy of clusters
➢ Construction (bottom up): join the two closest existing clusters
➢ Result: dendrogram
8 9 2
7
5
2 4 1
3 6
5
1
1 0
1 2 3 4 5 6 7 8 9
1 5
Florian Lemmerich
41 Social Data Science
Example Dendrogram
Florian Lemmerich
42 Social Data Science
http://upload.wikimedia.org/wikipedia/commons/3/39/Swiss_complete.png
8.3 Where does ML work?
43
Why now?
Why is there so much “snake oil” – absurd or even harmful application of AI/data
science in these fields.
1. Some technologies have made remarkable progress: content identification
(reverse image search), facial recognition, speech to text, medical diagnoses,
translation.
2. Companies advertise algorithms/AI as a solution to all problems:
https://governanceai.github.io/US-Public-Opinion-Report-Jan-2019/us_public_opinion_report_jan_2019.pdf
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
44 Social Data Science # 44
What determines where “AI” will work?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
45 Social Data Science # 45
What determines where “AI” will work?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
46 Social Data Science # 46
What determines where “AI” will work?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
47 Social Data Science # 47
What determines where “AI” will work?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
48 Social Data Science # 48
Can social outcomes be predicted?
The Fragile Families Study is an ongoing study of a large group of young people
throughout their lives. Since the late 90s, a cohort of nearly 5,000 children from
the US have been tracked and repeatedly surveyed.
The aim is to understand how family structure relates to outcomes of the
children.
A research team lead by Matt Salganik issued a challenge to researchers: can
you predict social outcomes?
https://fragilefamilies.princeton.edu/about
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
49 Social Data Science # 49
Can social outcomes be predicted?
https://fragilefamilies.princeton.edu/about
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
50 Social Data Science # 50
Can social outcomes be predicted?
https://fragilefamilies.princeton.edu/about
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
51 Social Data Science # 51
Can social outcomes be predicted?
https://fragilefamilies.princeton.edu/about
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
52 Social Data Science # 52
Can social outcomes be predicted?
https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf
Florian Lemmerich
53 Social Data Science # 53