KEMBAR78
Class Notes 02feb2023 | PDF | Mean Squared Error | Errors And Residuals
0% found this document useful (0 votes)
27 views70 pages

Class Notes 02feb2023

- The document discusses linear and non-linear regression models for machine learning. - For linear regression, the model predicts outputs based on linear combinations of input features. - For non-linear regression, the model accounts for quadratic and higher-order polynomial relationships between inputs and outputs by adding derived features like squares and cross-terms to the input data. - However, adding too many higher-order terms can lead to overfitting, so regularization is introduced to penalize complex models.

Uploaded by

arindamsinharay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views70 pages

Class Notes 02feb2023

- The document discusses linear and non-linear regression models for machine learning. - For linear regression, the model predicts outputs based on linear combinations of input features. - For non-linear regression, the model accounts for quadratic and higher-order polynomial relationships between inputs and outputs by adding derived features like squares and cross-terms to the input data. - However, adding too many higher-order terms can lead to overfitting, so regularization is introduced to penalize complex models.

Uploaded by

arindamsinharay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Lecture 8: 2 February, 2023

Madhavan Mukund
https://www.cmi.ac.in/~madhavan

Data Mining and Machine Learning


January–April 2023
Linear regression
Training input is
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

~
Each input xi is a vector (xi1 , . . . , xik )
Add xi0 = 1 by convention
yi is actual output

How far away is our prediction h✓ (xi ) from


the true answer yi ?
Define a cost (loss) function
n
1X
J(✓) = (h✓ (xi ) yi ) 2
2
i=1

Essentially, the sum squared error (SSE) -


Justified via MLE
Divide by n, mean squared error (MSE)
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 2 / 22
The non-linear case

What if the relationship is


not linear?

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22


The non-linear case

What if the relationship is


not linear?
Here the best possible
explanation seems to be a
quadratic

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22


The non-linear case

What if the relationship is


not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22


The non-linear case

What if the relationship is


not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies
Input xi : (xi1 , xi2 )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22


The non-linear case

What if the relationship is


not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies
Input xi : (xi1 , xi2 )
Quadratic dependencies:
y = ✓0 + ✓1 xi1 + ✓2 xi2 + ✓11 xi21 + ✓22 xi22 + ✓12 xi1 xi2
--
linear quadratic
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22
The non-linear case

Recall how we fit a line


⇥ ⇤ ✓0
1 xi 1
✓1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 4 / 22


The non-linear case

Recall how we fit a line


⇥ ⇤ ✓0
1 xi 1
✓1

For quadratic, add new


coefficients and expand
parameters
2 3
⇥ ⇤ ✓0
1 xi1 xi21 4 ✓1 5
✓2

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 4 / 22


The non-linear case

Input (xi1 , xi2 )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 5 / 22


The non-linear case

Input (xi1 , xi2 )


For the general quadratic
case, we are adding new
derived “features”
xi 3 = xi21
xi 4 = xi22
xi 5 = x i 1 xi 2

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 5 / 22


The non-linear case

Original input matrix


2 3
1 x11 x1 2
6 1 x21 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi xi 2 7
6 1 7
4 ··· 5
1 xn 1 xn 2

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 6 / 22


The non-linear case

Expanded input matrix


2 3
1 x1 1 x1 2 x121 x122 x 1 1 x1 2
6 7
6 1 x2 1 x2 2 x221 x222 x 2 1 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi 1 xi 2 xi21 xi22 x i 1 xi 2 7
6 7
4 ··· 5
1 x n 1 xn 2 xn21 xn22 xn 1 xn 2

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 7 / 22


The non-linear case

Expanded input matrix


2 3
1 x1 1 x1 2 x121 x122 x 1 1 x1 2
6 7
6 1 x2 1 x2 2 x221 x222 x 2 1 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi 1 xi 2 xi21 xi22 x i 1 xi 2 7
6 7
4 ··· 5
1 x n 1 xn 2 xn21 xn22 xn 1 xn 2

New columns are computed


and filled in from original
inputs

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 7 / 22


Exponential parameter blow-up

Cubic derived features

Jesic
xi31 , xi32 , xi33 ,

xi21 xi2 , xi21 xi3 ,


xi22 xi1 , xi22 xi3 ,
xi23 xi1 , xi23 xi2 ,
x i 1 xi 2 x i 3 ,

7
xi21 , xi22 , xi23 ,
Quadrate
x i 1 xi 2 , xi 1 x i 3 , x i 2 xi 3 ,

x i 1 , x i 2 , xi 3 . - Linear
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 8 / 22
Higher degree polynomials

How complex a polynomial


should we try?

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22


Higher degree polynomials

How complex a polynomial


should we try?
Aim for degree that
minimizes SSE

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22


Higher degree polynomials

How complex a polynomial


should we try?
Aim for degree that
minimizes SSE
As degree increases,
features explode
exponentially

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22


Overfitting

Need to be careful about


adding higher degree terms

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22


Overfitting

Need to be careful about


adding higher degree terms
For n training points,can
always fit polynomial of
degree (n 1) exactly
However, such a curve
would not generalize well to
new data points

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22


Overfitting

Need to be careful about


adding higher degree terms
For n training points,can
always fit polynomial of
degree (n 1) exactly
However, such a curve
would not generalize well to
new data points
Overfitting — model fits
training data well, performs
poorly on unseen data

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22


Regularization

Need to trade o↵ SSE


against curve complexity

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22


Regularization

Need to trade o↵ SSE


against curve complexity
So far, the only cost has
been SSE

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22


Regularization

Need to trade o↵ SSE


against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22


Regularization

Need to trade o↵ SSE


against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )
Minimize, for instance
n k
1X X
(zi yi )2 + ✓j2
2
i=1 j=1
5-
SSE
coefficients
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization

Need to trade o↵ SSE


against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )
Minimize, for instance
n k
1X X
(zi yi )2 + ✓j2
2
i=1 j=1

Second term penalizes curve


complexity
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22


Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Ridge regression:
k
X
Coefficients contribute ✓j2
j=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22


Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Ridge regression:
k
X
Coefficients contribute ✓j2
j=1

LASSO regression:
k
X
Coefficients contribute |✓j |
j=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22


Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Ridge regression:
k
X
Coefficients contribute ✓j2
j=1

LASSO regression:
k
X
Coefficients contribute |✓j |
j=1

Elastic net regression:


k
X
2
Coefficients contribute 1 |✓j | + 2 ✓j
j=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22
The non-polynomial case

Percentage of urban
population as a function of
per capita GDP
Not clear what polynomial
would be reasonable


Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 13 / 22
The non-polynomial case

Percentage of urban
population as a function of


per capita GDP
Not clear what polynomial
would be reasonable
Take log of GDP
Regression we are
computing is
y = ✓0 + ✓1 log x1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 13 / 22


The non-polynomial case

Reverse the relationship


Plot per capita GDP in
terms of percentage of
urbanization
Now we take log of the
output variable
log y = ✓0 + ✓1 x1
Log-linear transformation
Earlier was linear-log
Can also use log-log

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 14 / 22


Regression for classification

Regression line

Set a threshold
Classifier
Output below threshold : 0 (No)
Output above threshold : 1 (Yes)

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 15 / 22


Regression for classification
classific
Regression line ↓
PhHed
Set a threshold
line
Classifier
Output below threshold : 0 (No) Treshold
Output above threshold : 1 (Yes)

Classifier output is a step function

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 15 / 22


Smoothen the step
2
2 1,2
-
-
-> 0
Sigmoid function

(z) =
1 27-
8,1 -
0
1+e z

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22


Smoothen the step

Sigmoid function
1
(z) = z
1+e

Input z is output of our


regression
1
(z) =
1 + e 0 1 x1 +···+✓k xk )
(✓ +✓
-

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22


Smoothen the step

Sigmoid function

j
1
(z) = z
1+e

Input z is output of our


regression
1
(z) =
1 + e 0 1 x1 +···+✓k xk )
(✓ +✓

Adjust parameters to fix


horizontal position and steepness
of step

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22


Logistic regression

Compute the coefficients?

Solve by gradient descent

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22


Logistic regression

Compute the coefficients?

Solve by gradient descent


Need derivatives to exist
Hence smooth sigmoid, not
step function
Check that
0
(z) = (z)(1 (z))

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22


Logistic regression

Compute the coefficients?

Solve by gradient descent


Need derivatives to exist
Hence smooth sigmoid, not
step function
Check that
0
(z) = (z)(1 (z))

Need a cost function to minimize

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n

Lacto
i=1
↳preducted
class

class

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
1X
n S
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
O
n
@C 2X @ (zi )
= (yi (zi )) ·
@✓j n @✓j
i=1

-1.80

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0

↓v'(z)
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2 X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j

=
i=1 i=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
i=1 i=1
n
X
2
= ( (zi ) yi ) 0 (zi )xij
n
i=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
i=1 i=1
n
X
2
= ( (zi ) yi ) 0 (zi )xij
n
i=1
n n
@C 2X @ (zi ) @zi 2X
= ( (zi ) yi ) = ( (zi ) yi ) 0 (zi )
@✓0 n @zi @✓0 n
i=1 i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22
MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1

@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22


MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1

@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22


MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1

@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large
(z) is flat at both extremes
If (z) is completely wrong,
(z) ⇡ (1 y ), we still have
0 (z) ⇡ 0

Learning is slow even when current


model is far from optimal

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22


Loss function for logistic regression

Goal is to maximize log likelihood

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ).

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
&
y1
=
-

Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi


-

y =
0

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )

Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

i=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )

Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

U(G) log(π -)
n
Y -

h✓ (xi )yi · (1 h✓ (xi ))1 yi


-

Likelihood: L(✓) =
i=1
n
X
=
Elg( -

.)
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi )) =

i=1
*

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

#
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ), vz
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )

Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

i=1
n
X
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
-
i=1
n
X
Minimize cross entropy: yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]

@C @C @
=
@✓j @ @✓j

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

↓0
C= [y ln( (z)) + (1 y ) ln(1 (z))]

@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j
-
~

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]



@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]



@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

y 1 y 0
= (z)xj
(z) 1 (z)

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]



@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

y 1 y 0
= (z)xj
(z) 1 (z)

y (1 (z)) (1 y ) (z) 0
= (z)xj
(z)(1 (z))

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
- -

Recall that 0 (z) = (z)(1 (z))

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y/ (z) /(z) + y (z)]xj
= ( (z) y )xj

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Thus, as we wanted, the gradient is proportional to (z) y

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Thus, as we wanted, the gradient is proportional to (z) y
The greater the error, the faster the learning rate
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22

You might also like