Unit - 1 - Introduction of the machine learning

Machine Learning
Course Code: CS4013
Course Teacher: Mr. Taranpreet Singh

Class: - B. Tech CE Semester – VII L T P Credits
Course Code: CS4013 Course Name: Machine Learning
3 - - 3
Course Description:
This course provides a concise introduction to the fundamental concepts in machine
learning and popular Machine Learning (ML) algorithms. We will cover the standard and
most popular supervised learning algorithms including linear regression, logistic
regression, decision trees, k-nearest neighbor, an introduction to Bayesian learning and
the naïve Bayes algorithm, support vector machines and kernels and neural networks with
an introduction to Deep Learning. In the course we will discuss various issues related to
the application of machine learning algorithms.

Course Learning Outcomes:
After successful completion of the course, students will be able to:
1. Identify the challenges and opportunity for solving real-world problems using Machine
Learning approach.
2. Select appropriate evaluation metrics for checking ML algorithm accuracy.
3. Construct graphical presentation of data by applying ML algorithms.
4. Apply clustering algorithm to classify non- labeled data.
5. Choose appropriate ML and Neural network algorithm for predication of class label or
value of target attribute.
Prerequisites:
 Basic knowledge of Probability theory
 Basic knowledge of python programming

Course Content
Unit No Description Hrs
1
Introduction to Machine learning
Introduction: Basic definitions, types of learning: Supervised, Unsupervised and Reinforcement Learning,
hypothesis space and inductive bias, evaluation, cross- validation, Overfitting & Underfitting.
06
2
Probability and Statistics
Data Exploration and Pre-processing: Data Objects and Attributes; Statistical Measures, Visualization, Data
Cleaning, and Integration. Dimensionality Reduction: Linear Discriminant Analysis; Principal Component
Analysis; Transform Domain and Statistical Feature Extraction and Reduction.
05
3
Regression and Classification Algorithms
Linear regression, Multiple linear regression, Decision Tree Induction including Attribute Selection, and Tree
Pruning, Random Forests, Logistic Regression;
Support Vector Machine.
07
4
Clustering and Ensemble learning
K-means clustering, KNN, Hierarchical clustering, Density Based Clustering, Cluster validation, Clustering
application. Ensemble Learning: Bagging, boosting, Adaboost.
06
5
Recommendation System
Machine Learning based Recommendation System, Top recommendation Systems on the Internet, Approaches to
recommendation system design: Collaborative Filtering, Content based Filtering and hybrid approach, Natural
Language Processing.
06
6
Artificial Neural Network
Introduction to neural networks, Activation functions, learning rate, Stochastic gradient descent, feed
forward, back propagation, basics of deep learning
06

Some common Terms
• Artificial Intelligence (AI)
– Any method that tries to
replicate the results of some
aspect of human cognition
• Machine Learning (ML)
– Programs that perform better
with experience
• Deep Learning
– Its Subset of Machine
Learning
– ANN
– CNN 5
Deep Learning, I. Goodfellow
et al. (2016)

Introduction to Machine learning
Introduction:
Basic definitions.
Types of learning:
Supervised,
Unsupervised
Reinforcement Learning
Hypothesis space
Inductive bias
Evaluation
cross-validation
Overfitting & Underfitting. 6

Machine Learning is…
Machine learning, a branch of artificial intelligence,
concerns the construction and study of systems that
can learn from data.
9
Machine learning is an application of
artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without
being explicitly programmed.
Machine learning is the science of getting computers to
act without being explicitly programmed.

Machine Learning Paradigm
10
Classical
Programming
Rules
Data
Answers
Machine
Learning
Data
Answers
Rules

When Machine Learning is
useful?
• When experts are unable to explain their
expertise
– Image recognition
– Speech Recognition
– Driving a car
• When human expertise does not exist
– Hazardous environment – navigation on Mars
• Solution needs to be adapted for particular
cases
– User biometrics
– Patient Specific treatment 11

Machine Learning History
• 1950s
– Samuel’s checker-playing program
• 1960s
– Neural network: Rosenblatt’s
perceptron – delta rule
– Pattern recognition
– Minsky and Papert proved
limitations of perceptron
• 1970s
– Induction of symbolic concepts
– Expert systems
– Natural Language Processing
12

Machine Learning History
• 1980s
– Advanced Decision Trees and Rule Learning
– Resurgence of neural network – MLP – back propagation
algorithm
• 1990s
– Support Vector Machines (SVM)
– Data Mining
– Adaptive agents and web applications
– Text learning
– Reinforcement learning
– Ensembles - Adaboost
– Bayes Network
13
1994: First self driving car made
a road test
1997: Deep Blue beat the world
champion Gary Ksparov in chess
2009: Google built self-driving
car
2011: IBM Watson won Jeopardy
2014: Human Vision surpassed
by ML systems

Why ML is gaining popularity
in recent times?
• New Software and algorithms
– Neural Networks
– Deep Learning
• New Hardware – High Performance Computers
– GPU’s – massive computational power for
computing and online learning
• Cloud Enabled Systems
• Availability of Big Data
16

Definition
• Arthur Samuel (1959):
– Field of study that gives computers the ability to
learn without being explicitly programmed
• Tom Mitchell (1998):
A computer program is said to learn from experience
“E” with respect to some class of tasks “T” and
performance measure “P”, if its performance at tasks
in
“T”, as measured by “P”, improves with experience “E”.
18

Examples
• T: Playing checkers
• P: Percentage of games won against an arbitrary
opponent
• E: Playing practice games against itself
• T: Recognizing hand-written words
• P: Percentage of words correctly classified
• E: Database of human-labeled images of handwritten
words
• T: Categorize email messages as spam or non-spam
• P: Percentage of email messages correctly classified
19

Definition
Hal Daume III:
Machine learning is about predicting the future
based on the past.
Training
Data
learn
model/
predictor
Past
predict
model/
predictor
Future
Testing
Data

Domains and Applications
• Computer Vision
– Say what objects appear in an image
– Convert hand-written digits to characters 0…
9
– Detect where objects appear in an image
21

• Robot Control
– Design autonomous mobile robots that
learn to navigate from their own experience
22

• NLP
– Sentiment analysis: detect if a
product/movie review is positive, negative,
or neutral
– Speech recognition
– Machine translation
23

• Business Intelligence
– Forecasting product sales quantities taking
seasonality and trend in account
– Optimizing product location at a super
market retail outlet
24

Types of Learning
• Supervised Learning
– X, y (Pre-classified training data)
– Given an observation x, find best label y
• Unsupervised Learning
– X
– Given a set of x’s, cluster them
• Reinforcement Learning
– Determine what to do depends on Rewards
and punishment.
27

Supervised Learning
33
X y
Input-1 Output-1
Input-2 Output-2
Input-3 Output-3
. .
. .
. .
Input-n Output-n
Learning
Algorithm
Model
New input
x
Output y

Supervised Learning
• Regression
• Classification
34

• A model defined with a set of
parameters.
• Where h() is the model and  are its
parameters
• Regression : y – number
• Classification: y
35
Regression
function
OR
Discriminant
function
)
( 
x
h
y 

Methods under Supervised Learning
• Regression
– Linear Regression
– Logistic Regression
• Classification
– Decision Tree
– Random Forest
– KNN
– SVM
36

Unsupervised Learning
37
X
Input-1
Input-2
Input-3
.
.
.
Input-n
Learning
Algorithm
Clusters

Reinforcement Learning
• The problem is as
follows: We have an
agent and a reward,
with many hurdles
in between. The
agent is supposed
to find the best
possible path to
reach the reward.
39

Inductive Learning
• Inductive learning or “Prediction”:
– Given examples of a function (X, F(X))
– Predict function F(X) for new examples X
Classification
 F(X) = Discrete
Regression
 F(X) = Continuous
Probability estimation
 F(X) = Probability
40

Terminology – Features
• Instances are described in terms of
features
• Features
– properties that describe each instance
41

Terminology – Feature Space
Feature Space:
Properties that describe the
problem
Credit: Jesse Davis, University of Washington
0.0 1.0 2.0
3.0 4.0 5.0
0.0
1.0
2.0
3.0
42

Terminology
0.0 1.0 2.0
3.0 4.0 5.0
0.0
1.0
2.0
3.0
Sample / Example:
<0.5,2.8,+>
+
+
+ +
+
+
+
+
- -
-
- -
-
-
-
-
- +
+
+
-
-
-
+
+
Credit: Jesse Davis, University of Washington 43

Terminology - Hypothesis
Hypothesis:
Function for labeling examples
+
+
+ +
+
+
+
+
- -
-
- -
-
-
-
-
- +
+
+
-
-
-
+
+ Label: -
Label:+
?
?
?
?
0.0 1.0 2.0
3.0 4.0 5.0
0.0
1.0
2.0
3.0
44

Terminology – Hypothesis Space
Hypothesis Space:
Set of legal hypotheses
+
+
+ +
+
+
+
+
- -
-
- -
-
-
-
-
- +
+
+
-
-
-
+
+
The space of all hypothesis
that can, in principle, be
output by a learning
algorithm.
0.0 1.0 2.0
3.0 4.0 5.0
0.0
1.0
2.0
3.0
45

Hypothesis Space
• A function is represented in terms of
features
• Decide the features
• Define the class of function / language of
the function
 Hypothesis Space
•
46

Representations
• Linear Function
47
• Decision Tree

Representations
• Multivariate Linear Function
48

Representations
• Single Layer
Perceptron
49
• Multi-layer Neural
Network

Terminology
• Training Example <x, y> : Instance x with label y =
f (x)
• Training Data s : Collection of examples observed by
learning algorithm.
• Feature Space / Instance Space X : Set of all possible
objects those can be described by features.
• Concept c : Subset of objects from X (c is unknown).
• Target Function f : Maps each instance x  X to target
label y  Y
50
f : X  Y

Classifier
• Hypothesis h : function that approximates f
• Hypothesis Space H : set of functions we
allow for approximating f
The set of hypotheses that can be
produced, can be restricted further by
specifying a bias
• Input : training set S  X
• Output : hypothesis h  H
51

Size of hypothesis space
• Assume Boolean features
• If there are 4 input features (Boolean),
possible instances = 24
= 16
• Number of Boolean functions possible =
Number of possible subsets of 16 instances =
216
=224
• In general, if N input Boolean features, then
number of possible instances = 2N
and number
of possible functions = 22
N
52

Inductive Bias
• Choosing hypothesis space, needs to make
assumptions
– Experience alone doesn’t allow us to make
conclusions about unseen data instances
• Two types of bias:
– Restriction: Limit the hypothesis space
(e.g., look at rules)
– Preference: Impose ordering on hypothesis
space (e.g., more general, consistent with data)
53

Inductive Learning
• Inducing a general function from training
examples
– To construct hypothesis h to agree with c on
training examples
– A hypothesis is consistent if it agrees with all
training examples
– A hypothesis is said to generalize well if it correctly
predicts the value of y
54
Inductive learning is an ill posed problem : unless we
see all the possible examples, the data is not sufficient
for an inductive learning algorithm to find a unique
solution

Learning as refining the
hypothesis space
• Concept Learning is a task of searching an
hypothesis space of possible representations
looking for the representation(s) that best fits
the data, given the bias
• The tendency to prefer one hypothesis over
another is called a bias
• Occam’s Razor
55

Some more Types of
Inductive Bias
• Minimum description length
• Maximum Margin
56

Evaluation and Cross
Validation
57

Performance Evaluation of
Learning Algorithms
• Few Performance Measures
(Experimental evaluaion)
– Error
– Accuracy
– Precision / Recall
• Typical ways for Sampling Methods
– Train / Test datasets
– K-fold cross validation
58

Computation of Error
60
Useful for
regression
problems
•

Criterion function to assess
classifier performance
• Accuracy, error rate
• Other characteristics derived from the
confusion matrix
61

Criterion function to assess
classifier performance
• Accuracy, error rate
– Accuracy is the percent of correct
classifications.
– Error rate = is the percent of incorrect
classifications.
– Accuracy = 1 - Error rate.
62

Confusion matrix, two classes only
TP: Correct acceptance
FP: False alarms
TN: Correct rejections
FN: Misses
Performance measures calculated from the
confusion matrix:
Accuracy = (TN + TP)/total
= (TN + TP)/(TN+TP+FN+FP)
True positive rate, recall, sensitivity =
TP/actual positive = TP / (TP+FN)
True negative rate, specificity =
TN/actual negative = TN / (TN+FP)
Precision, predicted positive value =
TP/predicted positive = TP/(TP+FP)
False positive rate, false alarm =
FP/actual negative = FP/(TN+FP)
= 1 –
specificity 63
Overall, how often
is the classifier
correct?
When it's actually
yes, how often does
it predict yes?
When it's actually
no, how often does
it predict no?
When it predicts
yes, how often is it
correct?
When it's actually
no, how often does
it predict yes?
When it's actually
yes, how often does
it predict no?
Overall, how often
is it wrong?

Confusion matrix : Example
TP: We predicted positive (they have the
disease), and they do have the disease
FP: We predicted yes, but they do not
actually have the disease
TN: We predicted no, and they do not
have the disease
FN: We predicted no, but they actually
do have the disease
Performance measures calculated from
the confusion matrix:
Accuracy = (TN + TP)/total =
(100+50)/165 = 0.91
True positive rate, recall, sensitivity =
TP/actual positive = 100 / 105 = 0.95
True negative rate, specificity =
TN/actual negative = 50 / 60 = 0.83
Precision, predicted positive value =
TP/predicted positive = 100 / 110 = 0.91
False positive rate, false alarm =
FP/actual negative = 10 / 60 = 0.17
False negative rate = FN/actual positive
Two classes : 1. Sample has disease (Positive)
2. No disease
(Negative)
64

Confusion matrix, # of classes > 2
• Example : Predicting true class labels of optical character recognition
for numerals 0-9. There were 100 examples of each number class
available for the evaluation. Empirical performance is given in
percentage.
• The classifier allows the reject option, class label R.
65

Training vs. test data
• Problem: Finite data are available and
have to be used both for training and
testing
– More training data gives better
generalization.
– More test data gives better estimate for the
classification error probability.
–
02/06/2025 66

Training vs. test data
• Partitioning of available finite set of data to
training / test sets.
– Hold out
– Cross validation
– Bootstrap
–
69

Hold out method
• Given data is randomly partitioned into two
independent sets.
– Training set (e.g., 2/3 of data) for the statistical model
construction, i.e. learning the classifier.
– Test set (e.g., 1/3 of data) is hold out for the accuracy
estimation of the classifier.
• Random sampling is a variation of the hold out method:
– Repeat the hold out k times, the accuracy is estimated as the
average of the accuracies
70

K-fold cross validation
• The training set is randomly
divided into K disjoint sets of
equal size where each part has
roughly the same class
distribution.
• The classifier is trained K times,
each time with a different set
held out as a test set.
• The estimated error is the
mean of these K 71
Stratified Sampling

Leave-one-out
• A special case of K-fold cross validation with K
= n, where n is the total number of samples in
the training multiset.
• n experiments are performed using n 1
−
samples for training and the remaining
sample for testing.
• Computationally expensive.
72

Bootstrap aggregating (bagging)
• Given: training set T with n entries.
• Bootstrap generates k new
datasets Ti each of size n’ <= n by
sampling T with replacement 
some entries can be repeated in Ti.
• The remaining entries that were
not selected for training are used
for testing  This value is likely to
change from fold to fold
• The k statistical models (e.g.,
classifiers, regressors) are learned
using the above k bootstrap
samples.
•
73
• Sampling with replacement to form the training set.
- Improves stability and
accuracy of ML algorithms.
- Reduces variance
- Helps to avoid overfitting.

Three-way data splits
• If model selection and true error estimates are to be
computed simultaneously, the data needs to be divided into
three disjoint sets [Ripley, 1996]
– Training set: a set of examples used for learning: to fit the parameters of the
classifier
– Validation set: a set of examples used to tune the parameters of a classifier
– Test set: a set of examples used only to assess the performance of a fully-
trained classifier
• Why separate test and validation sets?
– The error rate estimate of the final model on validation data will be biased
(smaller than the true error rate) since the validation set is used to select the
final model
– After assessing the final model on the test set, YOU MUST NOT tune the model
any further!
74

Issues in Machine Learning
• What are good hypothesis spaces ?
• Algorithms that work with the hypothesis spaces
• How to optimize accuracy to future data points ?
• How can we have confidence in the results ?
•
76

• Suppose your email program watches which
emails you do or do not mark as spam, based
on that learns how to better filter spam. What
is the task T in this setting ?
A. Classifying emails as spam or not spam
B. Watching you label emails as spam or not spam
C. The number or fraction of emails correctly
classified as spam / not spam
D. None of the above. This is not a machine learnig
problem 81
A computer program is said to learn from experience
E with respect to some task T and some performance
measure P, if its performance on T, as measured by P,
improves with experience E.

Problem 1: You have a large inventory of identical items.
You want to predict how many of these items will sell over
the next 3 months.
Problem 2: You’d like software to examine individual
customer accounts, and for each account decide if it has
been hacked / compromised.
Should you treat these as classification or regression
problem?
A. Treat both as classification problems
B. Treat problem 1 as classification problem, problem 2 as
regression problem
C. Treat problem 1 as regression problem, problem 2 as
classification problem
D. Treat both as regression problems 82
You are running a company, and you want to develop
learning algorithms to address each of two problems

A. Given data of emails labelled as spam / not
spam, learn a spam filter
B. Given a set of news articles found on the web,
group them into set of articles about the same
story
C. Given a dataset of customer data,
automatically discover market segments and
group customers into different market
segments
D. Given a dataset of patients, diagnosed as either
having diabetes or not learn to classify new 83
Of the following examples, which would you address
using an unsupervised learning algorithm ?
(Select all that apply)

Unit - 1 - Introduction of the machine learning

More Related Content

Similar to Unit - 1 - Introduction of the machine learning

Recently uploaded

Unit - 1 - Introduction of the machine learning

Editor's Notes