CSCS 460 – Machine
Learning
Faizad Ullah
1
Traditional Computer Science
Tasks like:
Play an audio/video file
Display a text file on screen
Perform a mathematical operation on two numbers
Sort an array of numbers using Insertion Sort
Search for a string in a text file
…
Data
Output
Program
2
Problems that Traditional CS Can’t Handle
Tumor? Y/N Price? What was said? Summarize text
Data
Output
Program?
3
Machine Learning
Regression
Classification
4
Traditional CS
Data
Output
Program
Machine Learning
Data
Program
Output
5
What is Machine Learning?
Formally:
A computer program A is said to learn from experience E with respect to some class of tasks T and
performance measure P if its performance at tasks in T , as measured by P, improves with experience
E. (Tom Mitchell, 1997)
Informally:
Algorithms that improve on some task with experience.
To train a classifier, we need labelled data (called dataset)
6
Machine Learning Pipeline
7
Data – Big, Big,… data!
How do we obtain these massive datasets to train our Machine Learning models?
From real interactions e.g., call centers
Expert annotators e.g., hired tams of annotators
Crowd sourcing
Recaptcha Tagging
8
Task-Label Relationship
Labels are dictated by the task to be performed.
Example: Speech Technologies
What was said? Speech Recognition
Who said it? Speaker Recognition
Was it John Doe? Speaker Verification
Did it mention “hey Google”? Keyword Detection
What’s the language? Language Identification
Is the language native for the speaker?
What is their height?
What is the age of the speaker?
What is emotional state?
What was the sentiment?
Is the voice fake?
9
Task-Label Relationship
Example: Text Technologies
Who wrote it?
Summary of what was written?
Was it plagiarized?
What was the intent?
What language is this?
Is the language native for the speaker?
What is author’s literacy level?
What is the topic of this document?
What is emotional state?
What was the sentiment?
Can we fake this writing style?
10
Challenges of ML - Explainability
A classifier can potentially learn to classify on the basis of features not desirable for humans
All dogs waring a collar in the training data while no cat is wearing it – ML just learns to separate based
on collar
All horse images have a copyrights notice – ML just learns to recognize horses based on the copyrights
notice
Explainable ML: The results should be understandable by humans
As opposed to a black-box system
11
Challenges of ML – Fairness
AI tends to reflect the biases of the society
Human taggers who mark a recording as misinformation based on accent or gender
Court decisions in country that make a rich person’s acquittal more likely
Automated standardized testing in the US could yield unfavorable results for certain demographic
groups
AI plays a decision role in hiring decisions, with up to 72% of resumes in the US never being viewed by a
human (Automation Bias)
Decision on immigration, bank loans, credit history checks, criminal profiling
12
ML in Low-resource settings
Problems where large datasets and tools are not available
Natural Language Processing and Speech
Pakistan has 71 languages
We barely have speech recognition capabilities for Urdu!
13
Types of Learning
Supervised
The outcome is provided along with the data.
Unsupervised
The outcome is NOT provided along with the data.
14
Supervised Learning
15
What does a classifier see?
• Features
Day: Night:
1. 1.
2. 2.
3. 3.
4. 4.
5. 5.
What does a classifier see?
What does a classifier see?
Day vs. Night Classifier
Unsupervised Learning
20
Supervised Learning Setup
22
Feature Space: Tabular Data
Features/Dimensions Label/Class/Category
Height Weight B.P.Sys B.P.Dia Heart
(inches) (kgs) disease
62 70 120 80 No Record is 4-dimensional Feature Vector
72 90 110 70 No
74 80 130 70 No
65 120 150 90 Yes
Training Data/Training Split
67 100 140 85 Yes
64 110 130 90 No
69 150 170 100 Yes
66 125 145 90 ?
Testing Data/Testing Split
74 67 110 60 ?
As labels are discrete, this is a classification task.
23
Feature Space: Tabular Data
Features/Dimensions Label
Height Weight B.P.Sys B.P.Dia Choleste
(inches) (kgs) rol Level
62 70 120 80 150 A Record is 4-dimensional Feature Vector
72 90 110 70 165
74 80 130 70 135
65 120 150 90 210
Training Data/Training Split
67 100 140 85 195
64 110 130 90 125
69 150 170 100 250
66 125 145 90 ?
Testing Data/Testing Split
74 67 110 60 ?
As labels are continuous, this is a regression task.
24
Feature Space: Image Data
Images are nothing but a 2D/3D arrays with values of color
intensities, typically ranging 𝟎 − 𝟐𝟓𝟓
But we said a
record
should be 1D!
25
Feature Space: Image Data
The color Image is 3D array (𝑊𝑖𝑑𝑡ℎ × 𝐻𝑒𝑖𝑔ℎ𝑡 × 𝐶ℎ𝑎𝑛𝑛𝑒𝑙𝑠)
Color image has 3 channels while grayscale image has 1 channel.
26
Feature Space: Text Data
Suppose you are given labeled textual data in excel sheet
Document# Text Class
Training 1 The Best movie best Pos
2 The Best best ever Pos
3 The Best film Pos
4 The Worst cast ever Neg
Testing 5 The Best best best worst ever ?
the best movie ever film worst cast label
1 1 1 0 0 0 0 1
1 1 0 1 0 0 0 1
1 1 0 0 1 0 0 1
1 0 0 1 0 1 1 0
These are called “Binary Occurrences” features.
1 1 0 1 0 1 0 ?
27
Rules vs. Learning
Suppose we are working on classification of emails into “spam” and “ham”
(not spam)
We can write a complicated set of rules
Works well for a while
Cannot adapt well to new emails
Program could be reverse-engineered and circumvented
Learn the mapping between an email and its label using past labelled
data
Can be retrained on new emails
Not easy to reverse-engineer and circumvent in all cases
Easier to plug the leaks
References
Murphy Chapter 1
Alpaydin Chapter 1
TM Chapter 1
Lectures of Andrew Ng., Dr. Ali Raza, and “Machine Learning for Intelligent Systems
(CS4780/CS5780)”, Kilian Weinberger.
This disclaimer should serve as adequate citation.
29
Formalizing the Setup
𝑫 = { 𝒙𝟏, 𝒚𝟏 , 𝒙𝟐, 𝒚𝟐 , … , 𝒙𝒏, 𝒚𝒏
⊆𝑿×𝒀
Feature vector
𝑫 = { 𝒙𝟏, 𝒚𝟏 , ⊆𝑿×𝒀
𝒙𝟐, 𝒚𝟐 , … ,
Any categorical attribute can be
𝒙𝒏, 𝒚𝒏 converted to numerical representation.
Where,
𝑥
𝐷𝑖 is
𝑜𝑟the
𝑥 𝑖 dataset
is the input vector of the 𝑖𝑡ℎ sample/record/instance
𝑋 is the label
𝑌 space
d-dimensional feature space (ℝ𝑑)
If we don’t know the distribution,
The data points are drawn from an unknown distribution 𝑃 lets approximate that using
samples we gathered!
𝒙𝒊, 𝒚𝒊 ~𝑷(𝒙, 𝒚)
We want to learn a function ℎ ∈ 𝐻, such that for a new instance (𝒙𝟏, 𝒚)~𝑃
𝒉(𝒙) = 𝒚 with a high probability or at least 𝒉(𝒙) ≈ 𝒚
This also have to be from the In plain words, don’t train on
same distribution as 𝒙𝒊 dogs and ask prediction for cats.
31
Training and Testing: Formally
Testing Data
Training Data Traditional CS
Machine 𝒙~𝑷
Learning
𝒉(𝒙)
𝒙𝟏, 𝒙𝟐,…, 𝒙𝒏
𝒚 𝟏, 𝒚 𝟐, … , 𝒚 𝒏 𝒉
Label/Ground Truth Prediction
Model
Training Testing
𝒉 𝒙 = 𝒚 (Ideal)
𝒉 𝒙 ≈ 𝒚 (Plausible)
32
Label Space
Binary (Binary classification)
Sentiment: positive / negative
Email: spam / ham
Online Transactions Fraud: Yes
/ No
Tumor: Malignant / Benign
𝑦 ∈ 0,1
𝑦 ∈ {−1, 1}
Multi-class (multi-class classification)
Sentiment: Positive / Negative / Neutral
Emotion: Happy / Sad / Surprised /
Angry / …
Parts of Speech Tag: Noun / Verb /
Adjective / Adver / …
𝑦 ∈ {0,1,2, … }
Real-valued (Regression)
Temperature, height, age, length,
33
weight, duration, price, …
Hypothesis Space
The hypothesis ℎ is sampled from a hypothesis space 𝐻
𝒉∈𝑯 𝑯 ∈ {𝑯𝑫, 𝑯𝑹, 𝑯𝑺𝑽𝑴, 𝑯𝑫𝑳, … }
𝐻 can be thought of to contain types of hypotheses, which share
sets of assumptions like:
Support Vector Machines 𝑯𝑺𝑽𝑴 ∈ {𝑯𝟏, 𝑯𝟐, … }
Decision Tree 𝑯𝑫 ∈ {𝑯𝟏, 𝑯𝟐, … } 𝒉 ∈ 𝑯𝑫
Perception 𝑯𝑷 ∈ {𝑯𝟏, 𝑯𝟐, … }
Neural Networks 𝑯𝑵𝑵 ∈ {𝑯𝟏, 𝑯𝟐, … }
…
Selection done
Selection done automatically.
For example: ℎ ∈ 𝐻 for 𝐻 decision trees: manually.
Would be instance of decision trees of different height, arity, thresholds etc.
34
So, how do we choose our ℎ?
Randomly?
Exhaustively?
How do we evaluate 𝒉?
35
How to choose ℎ?
Randomly
May not work well
Like using a random program to solve your sorting problem!
May work if 𝐻 is constrained enough
Exhaustively
Would be very slow!
The space 𝐻 is usually very large (if not infinite)
𝐻 is usually chosen by ML Engineers (You!) based on their experience
ℎ ∈ 𝐻 is estimated efficiently using various optimization techniques (math
alert!)
Before moving to finding 𝒉, let’s first evaluate the labels.
36
Book Reading
Murphy – Chapter 1
37
References
Murphy Chapter 1
Alpaydin Chapter 1
TM Chapter 1
Lectures of Andrew Ng., Dr. Ali Raza, and “Machine Learning for Intelligent Systems
(CS4780/CS5780)”, Kilian Weinberger.
This disclaimer should serve as adequate citation.
38