KEMBAR78
Lecture 01 - Machine Learning Basics Revision | PDF | Machine Learning | Statistical Classification
0% found this document useful (0 votes)
35 views80 pages

Lecture 01 - Machine Learning Basics Revision

Uploaded by

ruwantikaganga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views80 pages

Lecture 01 - Machine Learning Basics Revision

Uploaded by

ruwantikaganga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

CS5802 - Advanced Machine Learning

Machine Learning Basics Revision


Dr. Nisansa de Silva,
Department of Computer Science & Engineering
http://nisansads.staff.uom.lk/

1
Learning: Why?

▪ “Learning is any process by which a system improves performance


from experience.” – Herbert Simon
▪ It is the way we human do every day.
▪ Intelligence is from experience.
▪ We can not design
– It is hard to design everything (we don’t know the exact process)
– It is hard to design even one thing (limited knowledge, dynamics, inaccuracy)
– Trade-off between “explicit design with assumption” and “conceptual design
with approximation“
– It is a black box with unknown process, but could be better than a white box
with presumably known (but in fact wrong or inaccurate) process.
Herbert A. Simon
– Let data tells you more! Turing Award 1975
Nobel Prize in Economics 1978

2
Learning: When?

▪ There is no need to “learn” to calculate payroll


▪ ML is used when:
– Human expertise does not exist (navigating on Mars)
– Humans can’t explain their expertise (speech recognition)
– Solution changes in time (routing on a computer network)
– Solution needs to be adapted/ customized to particular cases (personalized medicine, user
biometrics)
– Models are based on huge amounts of data (genomics)

3
Learning: What is?

▪ Learning general models from a data of particular


examples
▪ Machine learning is programming computers to
optimize a performance criterion using example data or
past experience.
▪ Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
▪ Example in retail: Customer transactions to consumer
behavior:
– People who bought “Da Vinci Code” also bought “The Five
People You Meet in Heaven” (www.amazon.com)

4
Learning: What is?

▪ Build a model that is a good and useful


approximation to the data.
▪ Learning can be viewed as using direct or indirect
experience to approximate a chosen target function.
▪ Function approximation can be viewed as a search
through a space of hypotheses (representations of
functions) for one that best fits a set of training data.
▪ Different learning methods assume different hypothesis
spaces (representation languages) and/or employ
different search techniques.

5
What is Machine Learning?

Alan Turing's proposal in his paper "Computing Machinery


and Intelligence", in which the question "Can machines
think?" is replaced with the question "Can machines do
what we (as thinking entities) can do?“.

A field of study that gives computers the ability to learn


without being explicitly programmed
- Arthur Samuel (1959)

A computer program is said to learn from experience 𝐸


with respect to some class of tasks 𝑇 and performance
measure 𝑃 if its performance at tasks in 𝑇, as measured by
𝑃, improves with experience 𝐸.
-Tom M. Mitchell

6
What is Machine Learning?
▪ As intelligence requires knowledge, it is necessary for the
computers to acquire knowledge.
▪ A branch of artificial intelligence, concerned with the design and
development of algorithms that allow computers:
– To evolve behaviors based on empirical data.
– Be capable of the autonomous acquisition and integration of
knowledge
– Given some model structure (prior) and a performance criterion
automatically improve their performance within it through example data
or experience.
– Integrate human knowledge and empirical evidence.

▪ Knowledge (prior) and empirical evidence (samples) are two


ends of the spectrum of resources in ML.
▪ Different approaches reside in different trade-off positions.

7
What is Machine Learning?

▪ Role of Statistics: Inference from a sample


▪ Role of Computer science: Efficient algorithms to
– Solve the optimization problem
– Representing and evaluating the model for inference

▪ Automating automation
▪ Getting computers to program themselves
▪ Writing software is the bottleneck
▪ Let the data do the work instead!

8
What is Machine Learning?

▪ Traditional Programming

Data
Computer Output
Program

▪ Machine Learning

Data
Computer Program Data Computer Program
Output

9
What is Machine Learning?
Machine Learning vs Statistics
▪ It is similar to statistics...
– Both fields try to uncover patterns in data
– Both fields draw heavily on calculus, probability, and
linear algebra, and share many of the same core
algorithms

▪ But it is not statistics!


– Stats is more concerned with helping scientists and
policymakers draw good conclusions; ML is more
concerned with building autonomous agents
– Stats puts more emphasis on interpretability and
mathematical rigor; ML puts more emphasis on predictive
performance, scalability, and autonomy.

10
What is Machine Learning?

11
Motivating Example: Learning to Filter Spam

▪ Example: Spam Filtering


▪ Spam - is all email the user does not want to
receive and has not asked to receive
– T : Identify Spam Emails
– P:
▪ % of spam emails that were filtered
▪ % of ham/ (non-spam) emails that were incorrectly filtered-out
– E : a database of emails that were labelled by users

12
The Learning Process

Measuring Preprocessing Dimensionality Model Learning Model


devices reduction Testing

The Analysis
“real world” results

• Sensors • Noise filtering • Feature selection • Classification • Cross-validation


• Databases • Feature extraction • Feature projection • Regression • Bootstrap
• Normalization • Clustering
• Description

• Number of recipients
• Size of message
• Number of attachments
• Number of “re”s in the subject line
• …
Email Server

13
Defining the Learning Task
Improve on task T, with respect to performance metric P, based on experience E
T : Playing checkers
P : Percentage of games won against an arbitrary opponent
E : Playing practice games against itself

T : Recognizing hand-written words


P : Percentage of words correctly classified
E : Database of human-labeled images of handwritten words

T : Driving on four-lane highways using vision sensors


P : Average distance traveled before a human-judged error
E : A sequence of images and steering commands recorded while observing a human driver.

T : Categorize email messages as spam or legitimate.


P : Percentage of email messages correctly classified.
E : Database of emails, some with human-given labels

14
Why use Machine Learning?

▪ Develop systems that can automatically adapt and customize


themselves to individual users.
– Personalized news or mail filter

▪ Discover new knowledge from large databases (data mining).


– Market basket analysis (e.g. diapers and beer)

▪ Ability to mimic human and replace certain monotonous tasks -


which require some intelligence.
– like recognizing handwritten characters

▪ Develop systems that are too difficult/expensive to construct


manually because they require specific detailed skills or
knowledge tuned to a specific task (knowledge engineering
bottleneck).

15
Tasks that are Best Solved by using a Learning Algorithm
A classic example : It is very hard to say what makes a 2

16
Tasks that are Best Solved by using a Learning Algorithm
Some more examples
▪ Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
– Objects in real scenes

▪ Generating patterns:
– Generating images or motion sequences

▪ Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant

▪ Prediction:
– Future stock prices or currency exchange rates
– Which movies will a person like?

17
Sample Applications/ Tasks That use Machine Learning

▪ Retail: Market basket analysis, Customer ▪ Computational biology/ Bioinformatics:


relationship management (CRM) Motifs, alignment
▪ Finance: Credit scoring, fraud detection, E- ▪ Web mining: Search engines
commerce
▪ Robotics: Space exploration, Construction,
▪ Programming: Optimization, Manufacturing
troubleshooting, Debugging
▪ Knowledge Engineering: Information
▪ Medicine: Medical diagnosis extraction, Social networks
▪ Telecommunications: Quality of service ▪ [Your favorite area]
optimization

18
Training and Testing

▪ Training is the process of


making the system able to
Data Practical
learn. acquisition usage

▪ No free lunch rule: Universal Set


(Partially Observed)
– Training set and testing set
come from the same
distribution
– Need to make some
assumptions or bias

Training Set Testing Sets


(Observed) (Unobserved)

19
Loss

ML Model
(Hypothesis)
Input Predicted
(Example) Example

Update

Calculated Evaluate
Loss (Loss Function)

20
Learning

21
Statistical Inference
▪ Inductive learning
– Involves the creation of a generalized rule for all the data.
– It uses a bottom-up approach
– Every time you eat peanuts, you start to cough. You are allergic to peanuts.
– Jennifer always leaves for school at 7:00 a.m. Jennifer is always on time. Jennifer assumes, then, that if
she leaves at 7:00 a.m. for school today, she will be on time.
Specific Pattern General
Observation Recognition Conclusion

▪ Deductive learning
– Uses the already available facts and information in order to give a valid conclusion.
– It uses a top-down approach.
– All numbers ending in 0 or 5 are divisible by 5. The number 35 ends with a 5, so it must be divisible by
5.
– Red meat has iron in it, and beef is red meat. Therefore, beef has iron in it
Do/Don’t
Specific Formulate Analyze
Collect Data reject
Observation Hypothesis Data
hypothesis

22
Statistical Inference

▪ Transductive learning
– Is used in the field of statistical learning theory to refer to
predicting specific examples given specific examples from a
domain.
– Unlike induction, specific examples are used directly for
reasoning, and no generalization is required.
– This can be a more specific problem to solve than induction.
– A classic example of a transductive algorithm is the k-nearest neighbor algorithm, which does
not model the training data, but uses it directly each time of a prediction.
– It is an interesting framing of supervised learning where the classical problem of “approximating a
mapping function from data and using it to make a prediction” is seen as more difficult than is
required. Instead, specific predictions are made directly from the real samples from the domain. No
function approximation is required.

23
Types of Learning

▪ Supervised Learning: Training data includes desired outputs


– Given: training data + desired outputs (labels)
– Classification
– Regression

▪ Unsupervised Learning: Training data does not include desired outputs


– Given: training data (without desired outputs)
– Clustering
– Finding association (in features)
– Probability distribution estimation
– Dimension reduction

▪ Semi-supervised learning: Training data includes a few desired outputs


– Given: training data + a few desired outputs

▪ Reinforcement Learning: Rewards from sequence of actions


– Rewards from sequence of actions
– Decision making (robot, chess machine)

24
Types of Learning: Supervised Learning
Feature
▪ Given examples of a vectors

function (𝑋, 𝐹(𝑋)) Training


Examples
▪ Predict function 𝐹(𝑋) for Machine
Learning
new examples 𝑋 Algorithm
– Discrete 𝐹(𝑋): Classification Labels
▪ 𝐹(𝑋) is categorical
– Continuous 𝐹(𝑋): Regression
▪ 𝐹(𝑋) is real-valued
– 𝐹(𝑋) = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑋):
Predictive
Probability estimation New Model
Example
Labels
Examples
Feature
vector

25
Types of Learning: Supervised Learning: Classification

▪ 𝑥 can be multi-dimensional
▪ Each dimension corresponds to an attribute
▪ Example 1: Breast cancer
– Clump Thickness
– Uniformity of Cell Size
– Uniformity of Cell Shape

▪ Example 2: Credit scoring


– Differentiating between low-risk and high-risk customers from their income
and savings
– Discriminant: IF 𝑖𝑛𝑐𝑜𝑚𝑒 > 𝜃1 AND 𝑠𝑎𝑣𝑖𝑛𝑔𝑠 > 𝜃2
THEN low-risk
ELSE high-risk

26
Types of Learning: Supervised Learning: Classification
Classification: Applications
▪ Aka Pattern recognition Training examples of a person

▪ Face recognition: Pose, lighting, occlusion (glasses,


beard), make-up, hair style
▪ Character recognition: Different handwriting styles.
▪ Speech recognition: Temporal dependency. Test images
– Use of a dictionary or the syntax of the language.
– Sensor fusion: Combine multiple modalities; eg, visual (lip
image) and acoustic for speech

▪ Medical diagnosis: From symptoms to illnesses


▪ ... AT&T Laboratories, Cambridge UK
http://www.uk.research.att.com/facedatabase.html

27
Types of Learning: Supervised Learning: Classification
Other Supervised Learning Settings

▪ Multi-Class Classification
– There are more than two classes to classify instances into.

▪ Multi-Label Classification
– Multi-label learning refers to the classification problem where each
example can be assigned to multiple class labels simultaneously.

▪ Semi-supervised classification
– Make use of labeled and unlabeled data

▪ One Class Classification


– Only instances from one label are given

28
Types of Learning: Supervised Learning: Regression

▪ Given (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 )


▪ Learn a function 𝑓(𝑥) to predict 𝑦 given 𝑥
– 𝑦 is real-valued == regression

▪ Find a relationship between a numeric dependent variable


and one or more independent variables
▪ Example: Price of a used car
– 𝑥 ∶ car attributes
y = wx+w0
– 𝑦 ∶ price
– 𝑦 = 𝑔 (𝑥 | 𝜃 )
– 𝑔 ( ) model,
– 𝜃 parameters

29
Types of Learning: Supervised Learning: Regression
Regression Applications
▪ Navigating a car: Angle of the steering wheel (CMU NavLab)
▪ Kinematics of a robot arm

(x,y) α1= g1(x,y)


α2= g2(x,y)
α2

α1

 Response surface design

30
Types of Learning: Supervised Learning: Uses

▪ Prediction of future cases: Use the


rule to predict the output for future
inputs
▪ Knowledge extraction: The rule is
easy to understand
▪ Compression: The rule is simpler than
the data it explains
▪ Outlier detection: Exceptions that are
not covered by the rule, e.g., fraud

31
Types of Learning: Supervised Learning: Multi-instance learning

▪ A supervised learning problem


where individual examples are
unlabeled; instead, bags or
groups of samples are labeled.

32
Types of Learning: Unsupervised Learning
Feature
vectors
▪ For about 40 years, unsupervised
learning was largely ignored by the Training
machine learning community Examples
– Some widely used definitions of machine Machine
learning actually excluded it. Learning
– Many researchers thought that clustering Algorithm
was the only form of unsupervised learning.

▪ It is hard to say what the aim of


unsupervised learning is.
– One major aim is to create an internal
representation of the input that is useful for
subsequent supervised or reinforcement Likelihood
learning. or
– You can compute the distance to a surface Model
Cluster Id
by using the disparity between two images. New or
But you don’t want to learn to compute Examples Better
Representation
disparities by stubbing your toe thousands of Feature
times. vector

33
Types of Learning: Unsupervised Learning: Clustering

▪ Given 𝑥1 , 𝑥2 , ..., 𝑥𝑛 (without labels)


– No expected output given
▪ Output hidden structure behind the 𝑥’s
– Clustering: is the assignment of a set of observations into
subsets (called clusters) so that observations in the same
cluster are similar in some sense.
– Anomaly Detection: Detecting patterns in a given data set
that do not conform to an established normal behavior.
▪ Learning “what normally happens”
▪ Example applications
– Customer segmentation in CRM
– Image compression: Color quantization
– Bioinformatics: Learning motifs

34
Types of Learning: Unsupervised Learning: Learning Associations

▪ Basket analysis:
– 𝑃(𝑌| 𝑋) probability that somebody who buys 𝑋 also buys 𝑌 where 𝑋 and 𝑌 are products/services.
– Example: 𝑃(𝑐ℎ𝑖𝑝𝑠|𝑏𝑒𝑒𝑟) = 0.7

35
Types of Learning: Semi-Supervised Learning

▪ Semi-supervised learning is supervised learning where the


training data contains very few labeled examples and a large
number of unlabeled examples.

▪ Self-supervised learning is a machine learning process


where the model trains itself to learn one part of the input
from another part of the input. It is also known as predictive
or pretext learning.

36
Types of Learning: Reinforcement Learning

▪ Given a sequence of states and actions with rewards, output a policy


– Rewards are: delayed, occasional, scalar
– Policy is a mapping from states to actions that tells you what to do in a given state
– The goal in selecting each action is to maximize the expected sum of the future rewards.
– We usually use a discount factor for delayed rewards so that we don’t have to look too far into the future.
– No supervised output

▪ Reinforcement learning is difficult:


– The rewards are typically delayed so its hard to know where we went wrong (or right).
– A scalar reward does not supply much information.

▪ Examples:
– Credit assignment problem
– Game playing
– Robot in a maze
– Balance a pole on your hand
– Multiple agents, partial observability, ...

37
Types of Learning: Reinforcement Learning
The Agent-Environment Interface

Agent and environment interact at discrete time steps : 𝑡 = 0, 1, 2, 𝐾


Agent observes state at step 𝑡: 𝑠𝑡 ∈ 𝑆
produces action at step 𝑡: 𝑎𝑡 ∈ 𝐴(𝑠𝑡 )
gets resulting reward: 𝑟𝑡+1 ∈ ℜ
and resulting next state: 𝑠𝑡+1

38
Learning Paradigms

▪ Multi-Task and Transfer Learning


▪ Active learning
▪ Ensemble learning
▪ Online learning and Incremental Learning
– Learns one instance at a time.

▪ Ranking and Preference Learning


▪ Sequence labeling
▪ Cost-sensitive Learning
▪ Concept Drift
▪ Collective classification
– When instances are dependent!

39
Learning Paradigms: Transfer learning

▪ A machine learning technique where a model


developed for a task/domain is reused as the
starting point for a model on another
task/domain

No labelled data in
source domain
Labelled data
Inductive transfer
available in the
learning
target domain
Labelled data in
source domain
Labelled data
Transductive
Transfer learning available only in
transfer learning
the source domain

No labelled data in
Unsupervised
both target and
transfer learning
source domain

40
Learning Paradigms: Active learning
▪ Active Learning
– Learner can query an oracle about class of an
unlabeled example in the environment.
– Learner can construct an arbitrary example and
query an oracle for its label.

▪ It is not Reinforcement Learning


– Where the learner can run directly in the
environment without any human guidance and
obtain feedback.

▪ It is not Passive Learning


– Where random examples are provided outside of
the learner’s control.
– Good training examples selected by a
“benevolent teacher.”
– “Near miss” examples

41
Learning Paradigms: Active learning

▪ Membership query synthesis


– It is a strategy of generating/ synthesizing queries based on some membership criteria.
– Model generates the query de novo (from the beginning)
– i.e — Model generates new queries instead of existing ones.

▪ Stream-based selective sampling


– It is a sequential strategy where we can randomly sample data points from a sampled distribution and then select if we want
to label it or not.
– Each query decision is made individually.
– The learner decides if they want to query that instance.
– There is some “informativeness measure” which helps the learner decide about query strategies.

▪ Pool-based sampling
– When there is small subset of labelled data in a large pool of unlabelled data, queries are selectively drawn from that pool
– Based on some “informativeness measure”, all the queries are first ranked and then the best one is taken into account.
– This is the main difference between stream-based and pool-based sampling
– i.e individual querying and group querying.

42
Learning Paradigms: Ensemble Learning

▪ The idea is to use multiple models to


obtain better predictive performance than
could be obtained from any of the
constituent models.
▪ Boosting involves incrementally building
an ensemble by training each new model
instance to emphasize the training
instances that previous models
misclassified.

43
Learning Paradigms: Ensemble methods
The Wisdom of Crowds
▪ Under certain controlled conditions, the
aggregation of information in groups,
resulting in decisions that are often
superior to those that can been made by
any single - even experts.
▪ Imitates our second nature to seek
several opinions before making any
crucial decision. We weigh the individual
opinions, and combine them to reach a
final decision

44
Learning Paradigms: Ensemble methods
Committees of Experts
▪ – “ … a medical school that has the objective that all
students, given a problem, come up with an identical
solution”
▪ There is not much point in setting
up a committee of experts from
such a group - such a committee
will not improve on the judgment of
an individual.
▪ Consider: There needs to be
disagreement for the committee to
have the potential to be better than
an individual.

45
Design

46
Designing a Learning System

Environment/
Experience
▪ Choose the training experience
▪ Choose exactly what is to be learned
– i.e. the target function
Training Testing
▪ Choose how to represent the target Data Data

function
▪ Choose a learning algorithm to infer the
target function from the experience
Performance
Learner Knowledge
Element

47
Training vs. Test Distribution

▪ We generally assume that the


training and test examples are
independently drawn from the same
overall distribution of data
▪ We call this “i.i.d” which stands for
“independent and identically
distributed”
▪ If examples are not independent,
requires collective classification
▪ If test distribution is different,
requires transfer learning

48
ML in a Nutshell

▪ Tens of thousands of machine learning algorithms


– Hundreds new every year

▪ There are several factors affecting the performance:


– Types of training provided
– The form and extent of any initial background knowledge
– The type of feedback provided
– The learning algorithms used

▪ Every machine learning algorithm has three components:


– Representation
– Evaluation
– Optimization

49
ML in a Nutshell: Representations

▪ Numerical functions ▪ Instance-based functions


– Linear regression – Nearest-neighbour
– Neural networks – Case-based
– Support vector machines
▪ Probabilistic Graphical Models
▪ Symbolic functions – Naïve Bayes
– Decision trees – Bayesian networks
– Rules in propositional logic – Hidden-Markov Models (HMMs)
– Rules in first-order predicate logic / Sets of – Probabilistic Context Free Grammars (PCFGs)
rules / Logic programs
– Markov networks

▪ Model ensembles

50
ML in a Nutshell: Evaluation

▪ Confusion Matrix Predicted


▪ Accuracy Negative Positive
▪ Precision and Recall True Negative False Positive
Negative
▪ Squared error (TN) (FP)
Actual
False Negative True Positive
▪ Likelihood Positive
(FN) (TP)
▪ Posterior probability
▪ Precision p is the number of correctly classified positive examples divided by the
▪ Cost / Utility total number of examples that are classified as positive.
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
▪ Margin ▪ P = Precision =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
=
𝑇𝑃 +𝐹𝑃

▪ Entropy ▪ Recall r is the number of correctly classified positive examples divided by the
total number of actual positive examples in the test set.
▪ K-L divergence
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
▪ R = Recall = =
▪ Etc. 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 +𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑃 +𝐹𝑁

51
ML in a Nutshell: Various Search/Optimization Algorithms

▪ The success of machine learning system ▪ Gradient descent


also depends on the algorithms. – Perceptron
▪ The algorithms control the search to find – Backpropagation
and build the knowledge structures. ▪ Dynamic Programming
▪ The learning algorithms should extract – HMM Learning
useful information from training examples. – PCFG Learning
▪ Combinatorial optimization ▪ Divide and Conquer
– E.g.: Greedy search – Decision tree induction
– Rule learning
▪ Convex optimization
– E.g.: Gradient descent ▪ Evolutionary Computation
– Genetic Algorithms (GAs)
▪ Constrained optimization – Genetic Programming (GP)
– E.g.: Linear programming – Neuro-evolution

52
ML in Practice

1. Understanding domain, prior knowledge, and goals

2. Data integration, selection, cleaning, pre-processing, etc.

3. Learning models 6. Loop

4. Interpreting results

5. Consolidating and deploying discovered knowledge

53
Linear Classifiers

▪ How would you classify this data?


– Ie, draw the decision boundry

54
Linear Classifiers

▪ How would you classify this data?


– Ie, draw the decision boundry

▪ When a new email is sent


– We first place the new email in the space

55
Linear Classifiers

▪ How would you classify this data?


– Ie, draw the decision boundry

▪ When a new email is sent


– We first place the new email in the space
– Classify it according to the subspace in which it resides
Email Length New Recipients

56
Linear Classifiers

▪ How would you classify this data?


– Ie, draw the decision boundry

▪ When a new email is sent


– We first place the new email in the space
– Classify it according to the subspace in which it resides
Email Length New Recipients

▪ Any of these would be fine … but which is best?

57
Linear Classifiers

▪ How would you classify this data?


– Ie, draw the decision boundry

▪ When a new email is sent


– We first place the new email in the space
– Classify it according to the subspace in which it resides Email
Length New Recipients

▪ Any of these would be fine … but which is best?


▪ Classifier Margin: Define the margin of a linear classifier
as the width that the boundary could be increased by
before hitting a data-point.

58
Linear Classifiers

▪ How would you classify this data?


– Ie, draw the decision boundry

▪ When a new email is sent


– We first place the new email in the space
– Classify it according to the subspace in which it resides Email Length
New Recipients

▪ Any of these would be fine … but which is best?

▪ Classifier Margin: Define the margin of a linear classifier as


the width that the boundary could be increased by before
hitting a data-point.
– The maximum margin linear classifier is the linear classifier with the,
maximum margin.
– This is the simplest kind of SVM (Called an LSVM)

59
No Linear Classifier can cover all instances

▪ How would you classify this data?

60
No Linear Classifier can cover all instances

▪ How would you classify this data?


▪ Ideally, the best decision boundary should be
the one which provides an optimal performance
such as shown in this figure
▪ However, our satisfaction is premature because
the central aim of designing a classifier is to
correctly classify novel input
▪ We have created an issue of generalization!

61
Decision tree

▪ A flow-chart-like tree structure


▪ Internal node denotes a test on an attribute
▪ Branch represents an outcome of the test
▪ Leaf nodes represent class labels or class distribution
▪ Decision trees divide the feature space into axis-parallel
rectangles and label each rectangle with one of the 𝐾
classes.

62
Top Down Induction of Decision Trees

▪ A single level decision tree is also known as Decision Stump

63
Top Down Induction of Decision Trees

64
Top Down Induction of Decision Trees

65
Top Down Induction of Decision Trees
Which One?

66
Issues and Solutions

67
Overfitting and Underfitting

▪ A: Generalization error
▪ B: Training Error
▪ C: Stopping Point
▪ D: Underfitting: when model is too simple,
both training and test errors are large
▪ E: Overtraining: means that it learns the
training set too well – it overfits to the
training set such that it performs poorly on
the test set.

68
Occam's razor (14th-century)

▪ In Latin “lex parsimoniae”, translating to


“law of parsimony”
▪ The explanation of any phenomenon
should make as few assumptions as
possible, eliminating those that make
no difference in the observable
predictions of the explanatory
hypothesis or theory

▪ The Occam Dilemma: Sadly, in ML, accuracy and simplicity (interpretability) are in
conflict.

69
No Free Lunch Theorem in Machine Learning
▪ “For any two learning algorithms, there are just as many
situations (appropriately weighted) in which algorithm
one is superior to algorithm two as vice versa,
according to any of the measures of "superiority“
▪ Hume (1739–1740) pointed out that ‘even after the
observation of the frequent or constant conjunction of
objects, we have no reason to draw any inference
concerning any object beyond those of which we have
had experience’.
▪ More recently, and with increasing rigour, Mitchell
(1980), Schaffer (1994) and Wolpert (1996) showed that
bias-free learning is futile.
▪ Wolpert (1996) shows that in a noise-free scenario
where the loss function is the misclassification rate, if
one is interested in off-training-set error, then there are
no a priori distinctions between learning algorithms.

70
No Free Lunch Theorem in Machine Learning
▪ More formally, where
– d = training set;
– m = number of elements in training set;
– f = ‘target’ input-output relationships;
– h = hypothesis (the algorithm's guess for f made in response to d); and
– C = off-training-set ‘loss’ associated with f and h (‘generalization error’)

▪ All algorithms are equivalent, on average, by any of the following measures of risk:
𝐸(𝐶|𝑑), 𝐸(𝐶|𝑚), 𝐸(𝐶|𝑓, 𝑑), or 𝐸(𝐶|𝑓, 𝑚).
– How well you do is determined by how ‘aligned’ your learning algorithm 𝑃(ℎ|𝑑) is with the actual
posterior, 𝑃(𝑓|𝑑).

▪ Wolpert's result, in essence, formalizes Hume, extends him and calls the whole of
science into question.

71
No Free Lunch Theorem in Machine Learning
So why Develop New Algorithms?
▪ Practitioner are mostly concerned with choosing the most appropriate algorithm for
the problem at hand
▪ This requires some a priori knowledge – data distribution, prior probabilities,
complexity of the problem, the physics of the underlying phenomenon, etc.
▪ The No Free Lunch theorem tells us that – unless we have some a priori
knowledge – simple classifiers (or complex ones for that matter) are not necessarily
better than others. However, given some a priori information, certain classifiers may
better MATCH the characteristics of certain type of problems.
▪ The main challenge of the practitioner is then, to identify the correct match between
the problem and the classifier! …which is yet another reason to arm yourself with a
diverse set of learner arsenal !

72
Less is More The Curse of Dimensionality (Bellman, 1961)

▪ Learning from a high-dimensional feature space requires an enormous


amount of training to ensure that there are several samples with each
combination of values.
▪ With a fixed number of training instances, the predictive power reduces
as the dimensionality increases.
▪ As a counter-measure, many dimensionality reduction techniques have
been proposed, and it has been shown that when done properly, the
properties or structures of the objects can be well preserved even in the
lower dimensions.
▪ Nevertheless, naively applying dimensionality reduction can lead to
pathological results.

73
The Problem with Dimensionality Reduction

▪ While dimensionality reduction is an


important tool in machine
learning/data mining, we must always
be aware that it can distort the data in
misleading ways.
▪ Plato's Cave still haunts us

74
The Problem with Dimensionality Reduction

▪ While dimensionality reduction is an


important tool in machine
learning/data mining, we must always
be aware that it can distort the data in
misleading ways.
▪ Plato's Cave still haunts us
▪ On the right is a two dimensional
projection of an intrinsically three
dimensional world….

75
The Problem with Dimensionality Reduction

▪ While dimensionality reduction is an


important tool in machine
learning/data mining, we must always
be aware that it can distort the data in
misleading ways.
▪ Plato's Cave still haunts us
▪ On the right is a two dimensional
projection of an intrinsically three
dimensional world….
▪ … which have given us a wrong view
of the world.

76
The Problem with Dimensionality Reduction

Watch the two videos from Jessica Lin

77
Instability and the Rashomon Effect

▪ Rashomon is a Japanese movie in which four people, from different


vantage points, witness a criminal incident. When they come to testify in
court, they all report the same facts, but their stories of what happened
are very different.
▪ The Rashomon effect is the effect of the subjectivity of perception on
recollection.
▪ The Rashomon Effect in ML is that there is often a multitude of
classifiers of giving about the same minimum error rate.
▪ For example in decision trees, if the training set is perturbed only
slightly, I can get a tree quite different from the original but with almost
the same test set error.

78
Summery
Task Data Learning Structure Learning Algorithm
From AI perspective Supervision
•Perception
Complexity of Data Functions •Supervised learning
•Induction •Binary, category, •Unsupervised learning
•Generation continuous, scale, vector, •Semi-supervised learning
graph, natural object Networks •Reinforcement learning

From technical perspective


•Dependent or (NN, graph)
independent Model
•Predictive •Complete or incomplete Logic programs •Probabilistic model
•Regression •Dynamics (GMM,HMM,pLDA,...)
•Classification and rule sets •Neural model (MLP,LSTM,...)
•Distance-based model (kNN,
•Descriptive metric learning, SVM)
•Clustering Finite-state •Information-based model (ME,
•Density estimation machines CART,...)
•Other criteria (LDA,...)
Data representation
•Feature extraction Grammars
Objective function Learning approach
•Dimension reduction
•xEnt,MSE,Fisher score,sparsity, •Direct solution
•Data selection
information •Numerical optimization
•Task-dependent (MPE in ASR, Problem solving •Evolution
e.g.) systems

79
References
▪ Introduction to Machine Learning by Ethem Alpaydın
▪ Machine Learning by Pedro Domingos
▪ Overview of Machine Learning by Ἀπφία Ελευθερόπουλος
▪ Introduction to Machine Learning by Eric Eaton
▪ Machine Learning by Rajat Sharma
▪ Introduction to Machine Learning by Lior Rokach
▪ Introduction to Reinforcement Learning for Beginners by Prathima Kadari
▪ Statistics in Marketing - Discrete Probability Distributions by Bernard Yeo
▪ Examples of Occam's Razor: Principle Simply Explained by Jennifer Gunner
▪ No Free Lunch Theorems
▪ A Treatise of Human Nature: Being an Attempt to Introduce the Experimental Method of Reasoning into Moral Subjects by David Hume
▪ The Danger of Dimensionality Reduction by Jessica Lin
▪ How was the voting in ancient Greece by greecehighdefinition
▪ No, Machine Learning is not just glorified Statistics by Joe Davison
▪ Chapter 1 - Introduction by National Cheng Kung University – NCKU
▪ 8 Inspirational Applications of Deep Learning by Jason Brownlee
▪ The world of loss function by 홍배 김 (Hongbae Kim)
▪ Introduction to Semi-Supervised Learning by TEKSANDS
▪ Gentle Introduction to Transduction in Machine Learning by Jason Brownlee
▪ Anomaly Detection and Complex Event Processing over IoT Data Streams by and Fatos Xhafa
▪ Let Your Model Select Which Data to Learn From — Basics of Active Learning by Parthvi Shah

80

You might also like