CMPE 442 Introduction to
Machine Learning
• Polynomial Regression
• Locally Weighted Linear Regression
• Logistic Regression/ Softmax Regression
Polynomial Regression
What if your data is actually more complex than a simple straight line?
We can use a linear model to fit nonlinear data
A model is said to be linear if it is linear in parameters
y 0 1 x1 2 x2
y 0 1 x 2 x 2
y 0 1 x1 2 x2 11 x12 22 x22 12 x1 x2
Add powers of each feature as new features, then train a linear model on
this extended set of features
Polynomial Regression
A kth order polynomial model in one variable is given by
y 0 1 x 2 x 2 ... k x k
A second order polynomial model in two variable is given by
y 0 1 x1 2 x2 11 x12 22 x22 12 x1 x2
An array containing n features can be transformed into an array containing
( n d )!
features, where d is polynomial order
d !n!
Polynomial Regression: Example
Let’s generate some nonlinear data based on a simple quadratic equation
x’s are assigned randomly
y 2 x 0.5 x 2
Polynomial Regression: Example
A straight line poorly fits nonlinear data
Case for underfitting
y 3.52 0.98 x
Polynomial Regression: Example
Add powers of each feature as new features
x (i ) [ x (i ) , ( x (i ) ) 2 ]
y 2 x 0.5 x 2
h(x) 2.04 0.96 x 0.51x 2
Polynomial Regression: Example
Polynomial Regression: Example
Evaluating the Performance of the
Model
RMSE - is the square root of the average of the sum of the squares of residuals
1 m
RMSE ( X, h )
m i 1
(h ( x (i ) ) y (i ) ) 2
R2- score or the coefficient of determination explains how much the total variance of
the dependent variable can be reduced by using the least square regression.
𝑅 =1− where, 𝑆𝑆 = ∑ (𝑦 − 𝑦) and 𝑆𝑆 = ∑ (ℎ 𝑥 − 𝑦( ))
Locally Weighted Linear Regression
In the original linear regression algorithm, to make a prediction at a query
point x, we would:
1. Fit θ to minimize (y i
(i )
θT x ( i ) ) 2
T (i )
2. Output θ x
In locally weighted linear regression algorithm
1. Fit θ to minimize w
i
(i )
( y ( i ) θT x ( i ) ) 2
2. Output θ x T (i )
Linear Regression
y 2 x 0.5 x 2
Linear Regression
y 2 x 0.5 x 2 h(x) 2.04 0.96 x 0.51x 2
Linear Regression
y 2 x 0.5 x 2 h(x) 2.04 0.96 x 0.51x 2
𝑥 = 2.5 ℎ 2.5 = 2.04 + 0.96 ∗ 2.5 + 0.51 ∗ 2.5 = 7.63
Locally Weighted Linear Regression
y 2 x 0.5 x 2
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Locally Weighted Linear Regression
This line is learnt specifically
y 2 x 0.5 x 2 for x_new
𝑥 = 2.5
Locally Weighted Linear Regression
y 2 x 0.5 x 2
𝑥 = 2.5
Gradient Descent
Update Rule:
𝜃 =𝜃 −η ∑ 𝑤 ( ) (ℎ 𝑥 ( ) − 𝑦 ( ) ) 𝑥 ()
The magnitude of the update is proportional to the error and the weight of
a sample
Locally Weighted Linear Regression
w(i)’s are non-negative valued weights.
Weights depend on the particular point x at which we are trying to
evaluate x.
w exp(
(i ) ( x ( i ) x) 2
) bandwidth parameter
2 2
If x x is small, then w(i ) is close to 1
(i )
If x x is large, then w(i ) is small
(i )
θ is chosen giving a much higher weight to the (errors on) training examples
close to the query point x.
Locally Weighted Linear Regression
Linear regression algorithm is a parametric learning algorithm
It has fixed, finite number of parameters which are fit to the data
Once we fit θ’s and store them away, we don’t need to keep training data for
future predictions
Locally weighted linear regression is a non-parametric algorithm
To make predictions we need to keep entire training set around
Amount of stuff we need to keep in order to represent the hypothesis h grows
linearly with the size of the training set
Question
Is the gradient descent algorithm affected by the scales of features?
Question
Is the gradient descent algorithm affected by the scales of features?
Feature Normalization
We can speed up the gradient descent by normalizing features (having
them in approximately equal ranges),
𝜃 will descent quickly on small ranges
1) Feature scaling: 𝑥 =
( )
2) Mean normalization: 𝑥 = 𝑜𝑟 𝑥 =
( )
For polynomial regression feature normalization is even more important.
Not needed for Normal Equation.
Summary on Linear Regression
Considered the case when the hypothesis function is linear.
We can make our hypothesis nonlinear by combining existing features and
obtaining more features and make the function quadratic, cubic.
Closed-form solution: Normal Equations
Scaling the data for gradient descent algorithm is important.
Logistic Regression
Classification problem
Similar to regression problem
Assume a binary classifier: y can take on only two values 0 and 1.
Ex.: spam email (1, positive class), ham email (0 negative class)
Linear Regression
ℎ 𝑥 = 𝜃 + 𝜃 𝑥 + 𝜃 𝑥 + ⋯+ 𝜃 𝑥
Classification Problem
Threshold classifier output at 0.5
If ℎ 𝑥 ≥ 0.5 predict 𝑦 = 1
ℎ 𝑥 = −0.34 + 1.58𝑥 If ℎ (𝑥) < 0.5 predict 𝑦 = 0
Classification Problem
𝑦 ∈ 0,1
So, we want 0≤ℎ 𝑥 ≤1
Logistic Regression
Let’s change the form of hypotheses h (x)
1
h ( x ) g ( T x ) T
1 e x
1
g (z) Logistic function or sigmoid function
1 ez
Logistic Regression
1
g (z)
1 ez
g ( z ) tends towards 1 as z
g ( z ) tends towards 0 as z
g(z) is always bounded between 0 and 1
1
h ( x ) g ( T x ) T
1 e x
n
x 0 j x j
T
x0 1
j 1
Logistic Regression
Predicts something that is
True or False, instead of
something continuous like
size
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Logistic Regression
Interpretation of Logistic Function
ℎ 𝑥 - can be estimated as a probability that 𝑦 = 1 on input 𝑥
ℎ 𝑥 = 𝑃(𝑦 = 1|𝑥; 𝜃)
𝑃 𝑦 = 0 𝑥; 𝜃 = 1 − 𝑃(𝑦 = 1|𝑥; 𝜃)
Decision Boundary
Assume: 𝑦 = 1 𝑖𝑓 ℎ 𝑥 ≥ 0.5
𝑦 = 0 𝑖𝑓 ℎ < 0.5
ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 )
−3
𝜃= 1
1
Predict 𝑦 = 1 if −3 + 𝑥 + 𝑥 ≥ 0
𝑥 +𝑥 =3 Decision Boundary
Logistic Regression
Given the logistic regression model, how do we fit θ for it?
Logistic Regression
Training set: { 𝑥 ,𝑦 , 𝑥 ,𝑦 , … , (𝑥 , 𝑦( )
)}
𝑚 examples
𝑥
𝑥= ⋮ 𝑥 =1
𝑥
𝑦 ∈ {0,1}
ℎ 𝑥 =
Cost Function
Linear Regression: 𝐽 𝜃 = ∑ (ℎ 𝑥 − 𝑦( ))
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = (ℎ 𝑥 − 𝑦( ))
Using this cost function for logistic regression gives non-convex function
Not guaranteed to converge to global minima
Cost Function
− log ℎ 𝑥 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 =
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0
𝐶𝑜𝑠𝑡 = 0 𝑖𝑓 𝑦 = 1 𝑎𝑛𝑑 ℎ 𝑥 = 1
But as ℎ 𝑥 → 0 𝐶𝑜𝑠𝑡 → ∞
If ℎ 𝑥 = 0 => predict that 𝑝 𝑦 = 1 𝑥; 𝜃 = 0 if actually 𝑦 = 1 then
penalizes learning algorithm by a very large cost
Cost Function
− log ℎ 𝑥 𝑖𝑓 𝑦 = 1 1
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = 𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( ))
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0 𝑚
Cost Cost
ℎ 𝑥 ℎ 𝑥
Cost Function
− log ℎ 𝑥 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 =
− log 1 − ℎ 𝑥 𝑖𝑓 𝑦 = 0
𝑦 = 1 𝑜𝑟 𝑦 = 0 only two values are possible
𝐶𝑜𝑠𝑡 ℎ 𝑥 , 𝑦 = −𝑦𝑙𝑜𝑔 ℎ 𝑥 − 1 − 𝑦 log 1 − ℎ 𝑥
𝐽 𝜃 = ∑ 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( )) = − ∑ 𝑦 ( ) log(ℎ 𝑥 )+ 1−
𝑦 log(1 − ℎ 𝑥 )
This is a convex function
Logistic Regression
To fit parameters 𝜃:
min 𝐽 𝜃 to obtain 𝜃
To make a prediction on given new 𝑥:
Output ℎ 𝑥 = 𝑝(𝑦 = 1|𝑥; 𝜃)
Gradient Descent
Want min 𝐽 𝜃 :
Repeat{
𝜃 ≔𝜃 −𝞰 𝐽(𝜃)
} (Simultaneously update for all 𝜃 s)
Gradient Descent
1
𝐽 𝜃 = 𝐶𝑜𝑠𝑡(ℎ 𝑥 , 𝑦( ))
𝑚
1
=− 𝑦 ( ) log(ℎ 𝑥 )+ 1−𝑦 log(1 − ℎ 𝑥 )
𝑚
𝜕 1
𝐽 𝜃 = (ℎ 𝑥 − 𝑦 ( ) )𝑥 ()
𝜕𝜃 𝑚
Gradient Descent
Want min 𝐽 𝜃 : 1
ℎ 𝑥 =
Repeat{ 1+𝑒
𝜃 ≔𝜃 −𝞰 ∑ (ℎ 𝑥 − 𝑦 ( ) )𝑥 ()
} (Simultaneously update for all 𝜃 s)
Python Implementation
Gives probabilities.
To get the classes just use log_reg.predict()
Example
Let’s apply Logistic Regression on Iris dataset.
Irisfamous dataset, contains the sepal and petal length and width of 150 iris flowers
Three different species: Iris-Setosa, Iris-Versicolor and Iris-Virginica
Iris Dataset
Example
Example
Nonlinear Decision Boundary
Nonlinear Decision Boundary
𝑥 =𝑥
Nonlinear Decision Boundary
ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 )
𝟏
ℎ 𝑥 = 𝑔(𝜃 + 𝜃 𝑥 + 𝜃 𝑥 +𝜃 𝑥 +𝜃 𝑥 )
−1
0
𝟏 𝑋 𝜃= 0
−𝟏
1
−𝟏 1
Predict 𝑦 = 1 if −1 + 𝑥 +𝑥 ≥0
𝑥 +𝑥 ≥1 𝑥 +𝑥 =1
Multiclass Classification
What if 𝑦 ∈ {1,2, … , 𝑘} classes
Example: weather prediction {sunny, cloudy, snowy, …}
𝑋
One-vs-all Classifier (one-vs-rest)
Class 1:
Class 2:
Class 3:
𝑋 𝑋 𝑋
𝑋 𝑋 𝑋
ℎ (𝑥) ℎ (𝑥) ℎ (𝑥)
One-vs-all Classifier (one-vs-rest)
ℎ 𝑥 = 𝑝 𝑦 = 𝑖 𝑥; 𝜃 𝑓𝑜𝑟 𝑖 = 1,2,3
One-vs-all:
Train a logistic regression classifier ℎ (𝑥) for each class 𝑖 to predict the probability
𝑦=1
On a new input 𝑥, to make a prediction pick the class 𝑖 that maximizes max ℎ (𝑥)
Softmax Regression
Multi-class classification
Multinomial Logistic Regression
Generalization of logistic regression
𝑦 ∈ {1,2, … , 𝐾}
Softmax Regression
Given a test input 𝑥, estimate the probability that 𝑃(𝑦 = 𝑘|𝑥) for each value
of 𝑘 = 1, … , 𝐾
Estimate the probability of the class label taking on each of the 𝐾 different
possible values.
Our hypothesis will output a 𝐾-dimensional vector (whose elements sum to
1) giving 𝐾 estimated probabilities.
Softmax Regression: Hypothesis
Hypothesis ℎ (𝑥) takes the form:
𝑃(𝑦 = 1|𝒙; 𝜃) exp(𝜽 𝒙)
𝑃(𝑦 = 2|𝒙; 𝜃) exp(𝜽 𝒙)
ℎ 𝒙 = =∑
⋮ (𝜽 𝒙) ⋮
𝑃(𝑦 = 𝐾|𝒙; 𝜃) exp(𝜽 𝒙)
𝜽( ) , 𝜽( ) , … , 𝜽( )
∈ℜ parameters of the model
∑
normalizes distribution so it sums to one
(𝜽 𝒙)
| | |
𝜽 = 𝜽( ) … 𝜽( ) n-by-K matrix
| | |
Softmax Regression: Cost Function
1 - indicator function
1 𝑇𝑟𝑢𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 1 and 1 𝐹𝑎𝑙𝑠𝑒 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡 = 0
Minimize the cost function:
()
exp(𝜽 𝒙( ) )
𝐽 𝜃 =− 1𝑦 = 𝑘 𝑙𝑜𝑔
∑ exp(𝜽 𝒙( ) )
(𝒊)
exp(𝜽 𝒙( ) )
𝑃 𝑦 = 𝑘 𝒙 ;𝜽 =
∑ exp(𝜽 𝒙( ) )
Softmax Regression
We cannot solve for the minimum of 𝐽(𝜃) analytically
Use iterative optimization algorithm.
Taking derivatives, one can show that the gradient is:
𝛻 𝐽 𝜽 =− 𝒙 (1 𝑦 = 𝑘 − 𝑃(𝑦 = 𝑘|𝒙; 𝜽))
exp(𝜽 𝒙( ) )
𝑃 𝑦 =𝑘 𝒙(𝒊) ; 𝜽 =
∑ exp(𝜽 𝒙( ) )
Example: Iris dataset
Python Implementation
ScikitLearn’s Logistic Regression uses one-versus-all bu default
Switch to Softmax Regression by setting multi-class=“multinomial”
Example: MNIST dataset
70000 images of digits handwritten by high school students and employees
of the US Census Bureau
Each image is 28x28 pixels (hence has 784 features)
Each feature simply represents one pixel’s intensity, from 0 (white) to 255
(black)
Is already split into a training set ( the first 60000 images) and a test set (last
10000 images)
Example: MNIST dataset
Example: MNIST dataset
Learned Weights
Example: MNIST dataset
Illustration of the decrease of the cost
function over time
Training Accuracy: 0.9209666666666667
Test Accuracy: 0.9206
Question
Suppose you train a logistic regression classifier and your hypothesis function H is
Which of the following figure will represent the decision boundary as given by above classifier?
Question
Imagine, you have given the below graph of logistic regression which is shows the relationships
between cost function and number of iteration for 3 different learning rate values (different colors
are showing different curves at different learning rates ).
Suppose, you save the graph for future reference but you forgot to save the value of
different learning rates for this graph. Now, you want to find out the relation between
the leaning rate values of these curve. Which of the following will be the true relation?
Note:
1. The learning rate for blue is l1
2. The learning rate for red is l2
3. The learning rate for green is l3
A) l1>l2>l3
B) l1 = l2 = l3
C) l1 < l2 < l3
D) None of these
Question
Suppose you are dealing with 4 class classification problem and you want to train a logistic
regression model on the data for that you are using One-vs-all method.
How many logistic regression models do we need to fit in 4-class classification problem?