Unit 4 - Machine Learning PDF
Unit 4 - Machine Learning PDF
Unit 4
https://medium.com/mlearning-ai/different-definitions-of-machine-learning-b46cedff9716
Machine Learning is
the subfield of AI
and is multidisciplinary
Different Types of Machine Learning
https://www.potentiaco.com/what-is-machine-learning-definition-types-applications-and-examples/
1. Supervised Learning (contd.)
1. Supervised Learning (contd.)
• Working Mechanism
• The algorithm then finds relationships between the parameters given,
essentially establishing a cause and effect relationship between the variables
in the dataset.
• At the end of the training, the algorithm has an idea of how the data works
and the relationship between the input and the output.
• This solution is then deployed for use with the final dataset, which it learns
from in the same way as the training dataset.
• This means that supervised machine learning algorithms will continue to
improve even after being deployed, discovering new patterns and
relationships as it trains itself on new data.
Types of Supervised Learning
Regression Classification
• Learning for prediction of value • Learning for classification of objects
• Has continuous output • Has discrete output
• Needs corresponding output with • Needs Labelled data with input
input
1. Supervised Learning (contd.)
• Applications
• Price Prediction
• Image classification and segmentation
• Disease identification and medical diagnosis
• Fraud detection
• Spam detection
• Speech recognition
• Sentiment Analysis
• Image Captioning
• Image Generation
• Text Generation etc.
1. Supervised Learning (contd.)
• Algorithms
• Linear regression
• Logistic regression
• Naive Bayes
• Linear discriminant analysis
• Decision trees
• K-nearest neighbor algorithm
• Neural networks (Multilayer perceptron)
• Random forest algorithm
• Support Vector Machine etc.
https://www.simplilearn.com/10-algorithms-machine-learning-engineers-need-to-know-article
2. Unsupervised Algorithm
• A class of algorithm which has input data but no corresponding
output or label.
• The goal for unsupervised learning is to model/ understand/
manipulate the underlying structure or distribution of the data in
order to learn more about the data.
• There is no supervision from expert through label or output, thus,
called unsupervised algorithm.
• Unsupervised machine learning purports to uncover previously
unknown patterns in data, but most of the time these patterns are
poor approximations of what supervised machine learning can
achieve.
Types of Unsupervised Algorithms
• Clustering allows you to automatically split the dataset into groups according to
similarity. Often, however, cluster analysis overestimates the similarity between groups
and doesn’t treat data points as individuals. For this reason, cluster analysis is a poor
choice for applications like customer segmentation and targeting.
• Anomaly detection can automatically discover unusual data points in your dataset. This is
useful in pinpointing fraudulent transactions, discovering faulty pieces of hardware, or
identifying an outlier caused by a human error during data entry.
• Association mining identifies sets of items that frequently occur together in your dataset.
Retailers often use it for basket analysis, because it allows analysts to discover goods
often purchased at the same time and develop more effective marketing and
merchandising strategies.
• Latent variable models are commonly used for data preprocessing, such as reducing the
number of features in a dataset (dimensionality reduction) or decomposing the dataset
into multiple components.
https://www.datarobot.com/wiki/unsupervised-machine-learning/
2. Unsupervised Algorithms (contd.)
• Applications
• Market Basket Analysis
• Semantic Clustering
• Delivery Store Optimization
• Identifying Accident Prone Areas
• Customer Segmentation
• Dimensionality Reduction
• Image Segmentation
• Audience segmentation
• Market research
• Recommendation System
2. Unsupervised Algorithms (contd.)
• Algorithms
• K-means clustering
• Hierarchical clustering
• Gaussian Mixture Models
• Apriori algorithms
• FP Growth
• Principal Component Analysis
• Singular Value Decomposition
• Autoencoders
• Local Outlier Factor
• Expected-Maximization
Linear Regression
• Linear regression attempts to
model the relationship between
two variables by fitting a linear
equation to observed data.
• One variable is considered to be an
explanatory variable, and the other
is considered to be a dependent
variable.
• For example, a modeler might want
to relate the weights of individuals
to their heights using a linear
regression model.
Linear Regression (contd.)
• Linear Regression is a simple yet powerful and mostly used algorithm
in data science.
• Linear regression shows the linear relationship between the
independent(predictor) variable i.e. X-axis and the
dependent(output) variable i.e. Y-axis, called linear regression.
• If there is a single input variable X(independent variable), such linear
regression is called simple linear regression.
• The above graph presents the linear relationship between the
output(y) variable and predictor(X) variables.
• The red line is referred to as the best fit straight line.
Linear Regression (contd.)
Linear Regression (contd.)
• The goal of the linear regression algorithm is to get the best values for
𝛽0 and 𝛽1 to find the best fit line.
• In simple terms, the best fit line is a line that fits the given scatter plot
in the best way.
• The best fit line is a line that has the least error which means the
error between predicted values and actual values should be
minimum.
Linear Regression (contd.)
Random Error(Residuals)
• In regression, the difference between the observed value of the
dependent variable(yi) and the predicted value is called the residuals.
𝜺 = 𝒚𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 – 𝒚𝑎𝑐𝑡𝑢𝑎𝑙
𝑚
2
𝑀𝑆𝐸 = 1/𝑚 𝑦𝑖 − 𝛽0 + 𝐵1 𝑥
𝑖=1
• We use different optimization techniques to minimize the error or
RSS. Some of them are ordinary least square method, gradient
descent etc.
X Ordinary Least Square Method
• This is a statistical Method for performing linear regression.
• Ordinary least square error is given by:
𝑚
1 2
𝑂𝐿𝑆𝐸 = ℎ𝛽 𝑥 𝑖 − 𝑦 𝑖
2
i= 1
• Using Least Square method, we can write
σ𝑚
𝑖=1(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦)ത
𝛽1 =
σ𝑚𝑖=1 𝑥𝑖 − 𝑥ҧ 2
0 = 𝑦ത − 𝛽
𝛽 1 𝑥ҧ
X Ordinary Least Square Method
X Gradient Descent
• We use linear regression to predict the dependent continuous
variable Y on the basis of independent X. It assumes the relationship
between independent and dependent variables to be linear as such:
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1 + … + 𝛽𝑛 𝑥𝑛
• Gradient Descent is one of the optimization algorithms that optimize
the cost function(objective function) to reach the optimal minimal
solution.
• To find the optimum solution we need to reduce the cost
function(MSE) for all data points.
• This is done by updating the values of 𝛽0 and 𝛽1 iteratively until we
get an optimal solution. 𝑚
1 𝑖 𝑖 2
𝐽 𝛽 = ℎ𝛽 𝑥 −𝑦
2𝑚
𝑖=1
X Gradient Descent
X Gradient Descent (contd.)
• We used the update rule as:
𝜕
𝛽𝑗 = 𝛽𝑗 − 𝛼 𝐽 𝛽
𝜕𝛽𝑗
𝛽𝑗 = 𝛽𝑗 + 𝛼(𝑦 − ℎ𝛽 (𝑥)) . 𝑥𝑗
or
𝑧 = 𝒙𝑇 𝒘 + 𝑏
𝑤3
𝑥3
Forward Propagation
l=1 l=2
1
𝑤11
𝑥1 𝑧11 =
𝑎11 = 𝑔 1 (𝑧11 )
𝑤111 𝑥1 +
1
𝑤21 𝑤211 𝑥2 + 2
𝑤11
𝑤311 𝑥3
1
𝑤31
[2]
𝑥2 𝑧1
𝑎12 = 𝑔 2 (𝑧12 ) 𝒚Ƹ
= 𝑤112 𝑎11
1
+ 𝑤212 𝑎22
𝑤12
2
𝑤21
1
𝑤22
𝑥3 𝑧 = 𝑤121 𝑥1 +
1
𝑤32 1 𝑎21 = 𝑔 1 (𝑧𝟐1 )
𝑤22 𝑥2 +
𝑤321 𝑥3
Forward Propagation
Vectorization Technique I
l=1 l=2
1
𝑤11 𝑧11 = 𝑤111 𝑥1 +
𝑥1 𝑎11 = 𝑔 1 (𝑧11 )
1
𝑤21 𝑤211 𝑥2 + 𝑤311 𝑥3 2
𝑤11
1
𝑤31 [2]
𝑧1
𝑥2 = 𝑤112 𝑎11 𝑎12 = 𝑔 2 (𝑧12 ) 𝒚Ƹ
1
𝑤12 2 + 𝑤212 𝑎22
𝑤21
1
𝑤22
𝑥3 1 𝑧 = 𝑤121 𝑥1 + 𝑎21 = 𝑔 1 (𝑧𝟐1 )
𝑤32
𝑤221 𝑥2 + 𝑤321 𝑥3
Layer 1 Computation
1
𝑤11 𝑧11 = 𝑤111 𝑥1 +
𝑥1 𝑎11 = 𝑔 1 (𝑧11 )
1
𝑤21 𝑤211 𝑥2 + 𝑤311 𝑥3 2
𝑤11
1 [2]
𝑤31 𝑧1
= 𝑤112 𝑎11 𝑎12 = 𝑔 2 (𝑧12 )
𝑥2 𝒚Ƹ
1 + 𝑤212 𝑎22
𝑤12 2
𝑤21
1
𝑤22
𝑥3 1 𝑧 = 𝑤121 𝑥1 + 𝑎21 = 𝑔 1 (𝑧𝟐1 )
𝑤32
𝑤221 𝑥2 + 𝑤321 𝑥3
Layer 2 Computation
𝑎11 𝑎12
2
𝑎 𝑎22 𝑤11 2 2 𝒁2 =𝑨1𝑾2 + 𝑏2
𝑨 𝟏
= 𝑎21 𝑎32 𝑾 2 = b = 𝑏1
31 2
𝑤21 𝑨 2 = 𝑔 2 (𝒁 2 )
𝑎41 𝑎42
Layer 1 Computation
𝑥11 𝑥12 𝑥13 1 1 1
𝑤11 𝑤21 𝑏1 𝒁 1 = 𝑿𝑾 1 + 𝑏 1
𝑥 𝑥22 𝑥23 b1 =
𝑨𝟎 = 𝑿 = 𝑥21 𝑥32 𝑥33 𝑾 1
= 1
𝑤12
1
𝑤22 1 𝑨 1 = 𝑔 1 (𝒁 1 )
31 𝑏2
𝑥41 𝑥42 𝑥43 1
𝑤13
1
𝑤23
Layer 2 Computation
𝑎11 𝑎12
2
𝑎 𝑎22 𝑤11 2 𝒁2 =𝑨1𝑾2 + 𝑏2
𝑨 𝟏
= 𝑎21 𝑎32 𝑾 2
= b 2
= 𝑏1
31 2
𝑤21 𝑨 2 = 𝑔 2 (𝒁 2 )
𝑎41 𝑎42
𝒁 𝑙 = 𝑨 𝑙−1 𝑾 𝑙 + 𝑏 𝑙 for l = 1 … L
𝑨 𝑙 = 𝑔 𝑙 (𝒁 𝑙 ) {
Z[l] = numpy.matmul(A[l-1], W[l]) + b[l]
A[l] = sigmoid(Z[l])
}
Forward Propagation
Vectorization Technique II
l=1 l=2
1
𝑤11 𝑧11 = 𝑤111 𝑥1 +
𝑥1 𝑎11 = 𝑔 1 (𝑧11 )
1
𝑤21 𝑤211 𝑥2 + 𝑤311 𝑥3 2
𝑤11
1
𝑤31 [2]
𝑧1
𝑥2 𝑎12 = 𝑔 2 (𝑧12 ) 𝒚Ƹ
1 = 𝑤112 𝑎11
𝑤12 2
𝑤21 + 𝑤212 𝑎22
1
𝑤22
𝑥3 1 𝑧 = 𝑤121 𝑥1 + 𝑎21 = 𝑔 1 (𝑧𝟐1 )
𝑤32
𝑤221 𝑥2 + 𝑤321 𝑥3
Layer 1 Computation
1 1 𝑇
𝑤11 𝑤21 1 1 1 𝒁1 = 𝑾1 𝑿+ 𝑏 1
b = 𝑏1 𝑏2
𝑿 = 𝑾 1
= 1
𝑤12
1
𝑤22 𝑨 1 = 𝑔 1 (𝒁 1 )
1 1
𝑤13 𝑤23
l=1 l=2
1
𝑤11 𝑧11 = 𝑤111 𝑥1 +
𝑥1 𝑎11 = 𝑔 1 (𝑧11 )
1
𝑤21 𝑤211 𝑥2 + 𝑤311 𝑥3 𝑤11
2
1 [2]
𝑤31 𝑧1
𝑥2 = 𝑤112 𝑎11 𝑎12 = 𝑔 2 (𝑧12 ) 𝒚Ƹ
1 + 𝑤212 𝑎22
𝑤12 2
𝑤21
1
𝑤22
𝑥3 1 𝑧 = 𝑤121 𝑥1 + 𝑎21 = 𝑔 1 (𝑧𝟐1 )
𝑤32
𝑤221 𝑥2 + 𝑤321 𝑥3
Layer 2 Computation
𝑎11 𝑎12
2 𝑇
𝑎 𝑎22 𝑤11 2 𝒁2 = 𝑾2 𝑨1 + 𝑏2
𝑨 1 = 𝑎21 𝑎32 𝑾 2 = 2
b 2 = 𝑏1
31
𝑤21 𝑨 2 = 𝑔 2 (𝒁 2 )
𝑎41 𝑎42
Layer 1 Computation
1 1
𝑥11 𝑥21 𝑥31 𝑥41 𝑤11 𝑤21 𝑇
𝑥32 𝑥42 𝒁1 = 𝑾1 𝑿+ 𝑏 1
𝑿 = 𝑥12 𝑥22 𝑾 1 = 𝑤121 𝑤22
1
b 1 = 𝑏11 1
𝑏2
𝑥13 𝑥23 𝑥33 𝑥43 𝑨 1 = 𝑔 1 (𝒁 1 )
1 1
𝑤13 𝑤23
Layer 2 Computation
2 𝑇
1
𝑎11 𝑎21 𝑎31 𝑎41 2 𝑤11 2 2 𝒁2 = 𝑾2 𝑨1 + 𝑏2
𝑨 = 𝑎 𝑎22 𝑎32 𝑎42 𝑾 = 2
b = 𝑏1
12 𝑤21 𝑨 2 = 𝑔 2 (𝒁 2 )
For Coding
Layer 1 Computation
1 1 𝑇
𝑥11 𝑥21 𝑥31 𝑥41 𝑤11 𝑤21 𝒁1 = 𝑾1 𝑿+ 𝑏 1
𝑨𝟎 = 𝑿 = 𝑥12 𝑥22 𝑥32 𝑥42 𝑾 1 = 𝑤121 1 b 1 = 𝑏11 𝑏2
1
𝑤22 𝑨 1 = 𝑔 1 (𝒁 1 )
𝑥13 𝑥23 𝑥33 𝑥43 1 1
𝑤13 𝑤23
Layer 2 Computation
2 𝑇
𝑎11 𝑎21 𝑎31 𝑎41 𝑤11 2 𝒁2 = 𝑾2 𝑨1 + 𝑏2
𝑨1 = 𝑎 𝑾2 = b2 = 𝑏1
12 𝑎22 𝑎32 𝑎42 2
𝑤21 𝑨 2 = 𝑔 2 (𝒁 2 )
𝑙 𝑙 𝑇 for l = 1 … L
𝒁 = 𝑾 𝑨 𝑙−1 + 𝑏 𝑙
{
𝑨 𝑙 = 𝑔 𝑙 (𝒁 𝑙 )
Z[l] = numpy.matmul(W[l].T, A[l-1]) + b[l]
A[l] = sigmoid(Z[l])
}
Backward Propagation
Backward Propagation
• Backpropagation, short for backward propagation of errors.
• It is a widely used method for calculating derivatives inside deep
feedforward neural networks.
• Backpropagation forms an important part of a number of supervised
learning algorithms for training feedforward neural networks, such as
stochastic gradient descent.
• When training a neural network by gradient descent, a loss function is
calculated, which represents how far the network's predictions are
from the true labels.
Backward Propagation
• Backpropagation allows us to calculate the gradient of the loss
function with respect to each of the weights of the network.
• This enables every weight to be updated individually to gradually
reduce the loss function over many training iterations.
• Backpropagation involves the calculation of the gradient proceeding
backwards through the feedforward network from the last layer
through to the first.
• To calculate the gradient at a particular layer, the gradients of all
following layers are combined via the chain rule of calculus.
https://deepai.org/machine-learning-glossary-and-terms/backpropagation
Forward Propagation
𝑥1
𝑤1
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3
+b
𝑤2 or
𝑥2
𝑎 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧) 𝒚Ƹ
𝑧 = 𝒘𝑇 𝒙 + 𝑏
or
1. Compute 𝐿𝑜𝑠𝑠
𝑇
𝑧 = 𝒙 𝒘+𝑏 2. Perform Backward
𝑤3
Propagation
3. Update weights and
𝑥3
biases.
Backward Propagation
Backward Propagation
1. Compute 𝐿𝑜𝑠𝑠
This is called cross entropy loss commonly
𝐿𝑜𝑠𝑠 = 𝑦 log 𝑦ො − 1 − 𝑦 log(1 − 𝑦)
used in classification. We may use other
loss function e.g. MSE as well.
2. Perform Backward Propagation
𝜕𝐿𝑜𝑠𝑠
a. Compute 𝜕𝑦ො
3. Update Weights and Biases
𝜕𝐿𝑜𝑠𝑠 𝜕𝐿𝑜𝑠𝑠 𝜕𝑦ො
b. Compute = 𝜕𝐿𝑜𝑠𝑠
𝜕𝑎 𝜕𝑦ො 𝜕𝑎
𝑤 =𝑤 −𝛼
𝜕𝑤
𝜕𝐿𝑜𝑠𝑠 𝜕𝐿𝑜𝑠𝑠 𝜕𝑦ො 𝜕𝑎
c. Compute =
𝜕𝑧 𝜕𝑦ො 𝜕𝑎 𝜕𝑧 𝜕𝐿𝑜𝑠𝑠
𝑏 =𝑏 −𝛼
𝜕𝐿𝑜𝑠𝑠 𝜕𝐿𝑜𝑠𝑠 𝜕𝑦ො 𝜕𝑎 𝜕𝑧 𝜕𝑏
d. Compute =
𝜕𝑤 𝜕𝑦ො 𝜕𝑎 𝜕𝑧 𝜕𝑤