Lecture 8: 2 February, 2023
Madhavan Mukund
https://www.cmi.ac.in/~madhavan
Data Mining and Machine Learning
January–April 2023
Linear regression
Training input is
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}
~
Each input xi is a vector (xi1 , . . . , xik )
Add xi0 = 1 by convention
yi is actual output
How far away is our prediction h✓ (xi ) from
the true answer yi ?
Define a cost (loss) function
n
1X
J(✓) = (h✓ (xi ) yi ) 2
2
i=1
Essentially, the sum squared error (SSE) -
Justified via MLE
Divide by n, mean squared error (MSE)
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 2 / 22
The non-linear case
What if the relationship is
not linear?
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22
The non-linear case
What if the relationship is
not linear?
Here the best possible
explanation seems to be a
quadratic
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22
The non-linear case
What if the relationship is
not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22
The non-linear case
What if the relationship is
not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies
Input xi : (xi1 , xi2 )
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22
The non-linear case
What if the relationship is
not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies
Input xi : (xi1 , xi2 )
Quadratic dependencies:
y = ✓0 + ✓1 xi1 + ✓2 xi2 + ✓11 xi21 + ✓22 xi22 + ✓12 xi1 xi2
--
linear quadratic
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22
The non-linear case
Recall how we fit a line
⇥ ⇤ ✓0
1 xi 1
✓1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 4 / 22
The non-linear case
Recall how we fit a line
⇥ ⇤ ✓0
1 xi 1
✓1
For quadratic, add new
coefficients and expand
parameters
2 3
⇥ ⇤ ✓0
1 xi1 xi21 4 ✓1 5
✓2
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 4 / 22
The non-linear case
Input (xi1 , xi2 )
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 5 / 22
The non-linear case
Input (xi1 , xi2 )
For the general quadratic
case, we are adding new
derived “features”
xi 3 = xi21
xi 4 = xi22
xi 5 = x i 1 xi 2
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 5 / 22
The non-linear case
Original input matrix
2 3
1 x11 x1 2
6 1 x21 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi xi 2 7
6 1 7
4 ··· 5
1 xn 1 xn 2
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 6 / 22
The non-linear case
Expanded input matrix
2 3
1 x1 1 x1 2 x121 x122 x 1 1 x1 2
6 7
6 1 x2 1 x2 2 x221 x222 x 2 1 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi 1 xi 2 xi21 xi22 x i 1 xi 2 7
6 7
4 ··· 5
1 x n 1 xn 2 xn21 xn22 xn 1 xn 2
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 7 / 22
The non-linear case
Expanded input matrix
2 3
1 x1 1 x1 2 x121 x122 x 1 1 x1 2
6 7
6 1 x2 1 x2 2 x221 x222 x 2 1 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi 1 xi 2 xi21 xi22 x i 1 xi 2 7
6 7
4 ··· 5
1 x n 1 xn 2 xn21 xn22 xn 1 xn 2
New columns are computed
and filled in from original
inputs
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 7 / 22
Exponential parameter blow-up
Cubic derived features
Jesic
xi31 , xi32 , xi33 ,
xi21 xi2 , xi21 xi3 ,
xi22 xi1 , xi22 xi3 ,
xi23 xi1 , xi23 xi2 ,
x i 1 xi 2 x i 3 ,
7
xi21 , xi22 , xi23 ,
Quadrate
x i 1 xi 2 , xi 1 x i 3 , x i 2 xi 3 ,
x i 1 , x i 2 , xi 3 . - Linear
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 8 / 22
Higher degree polynomials
How complex a polynomial
should we try?
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22
Higher degree polynomials
How complex a polynomial
should we try?
Aim for degree that
minimizes SSE
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22
Higher degree polynomials
How complex a polynomial
should we try?
Aim for degree that
minimizes SSE
As degree increases,
features explode
exponentially
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22
Overfitting
Need to be careful about
adding higher degree terms
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22
Overfitting
Need to be careful about
adding higher degree terms
For n training points,can
always fit polynomial of
degree (n 1) exactly
However, such a curve
would not generalize well to
new data points
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22
Overfitting
Need to be careful about
adding higher degree terms
For n training points,can
always fit polynomial of
degree (n 1) exactly
However, such a curve
would not generalize well to
new data points
Overfitting — model fits
training data well, performs
poorly on unseen data
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22
Regularization
Need to trade o↵ SSE
against curve complexity
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization
Need to trade o↵ SSE
against curve complexity
So far, the only cost has
been SSE
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization
Need to trade o↵ SSE
against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization
Need to trade o↵ SSE
against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )
Minimize, for instance
n k
1X X
(zi yi )2 + ✓j2
2
i=1 j=1
5-
SSE
coefficients
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization
Need to trade o↵ SSE
against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )
Minimize, for instance
n k
1X X
(zi yi )2 + ✓j2
2
i=1 j=1
Second term penalizes curve
complexity
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22
Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function
Ridge regression:
k
X
Coefficients contribute ✓j2
j=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22
Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function
Ridge regression:
k
X
Coefficients contribute ✓j2
j=1
LASSO regression:
k
X
Coefficients contribute |✓j |
j=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22
Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function
Ridge regression:
k
X
Coefficients contribute ✓j2
j=1
LASSO regression:
k
X
Coefficients contribute |✓j |
j=1
Elastic net regression:
k
X
2
Coefficients contribute 1 |✓j | + 2 ✓j
j=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22
The non-polynomial case
Percentage of urban
population as a function of
per capita GDP
Not clear what polynomial
would be reasonable
⑧
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 13 / 22
The non-polynomial case
Percentage of urban
population as a function of
⑨
per capita GDP
Not clear what polynomial
would be reasonable
Take log of GDP
Regression we are
computing is
y = ✓0 + ✓1 log x1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 13 / 22
The non-polynomial case
Reverse the relationship
Plot per capita GDP in
terms of percentage of
urbanization
Now we take log of the
output variable
log y = ✓0 + ✓1 x1
Log-linear transformation
Earlier was linear-log
Can also use log-log
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 14 / 22
Regression for classification
Regression line
Set a threshold
Classifier
Output below threshold : 0 (No)
Output above threshold : 1 (Yes)
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 15 / 22
Regression for classification
classific
Regression line ↓
PhHed
Set a threshold
line
Classifier
Output below threshold : 0 (No) Treshold
Output above threshold : 1 (Yes)
Classifier output is a step function
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 15 / 22
Smoothen the step
2
2 1,2
-
-
-> 0
Sigmoid function
(z) =
1 27-
8,1 -
0
1+e z
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22
Smoothen the step
Sigmoid function
1
(z) = z
1+e
Input z is output of our
regression
1
(z) =
1 + e 0 1 x1 +···+✓k xk )
(✓ +✓
-
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22
Smoothen the step
Sigmoid function
j
1
(z) = z
1+e
Input z is output of our
regression
1
(z) =
1 + e 0 1 x1 +···+✓k xk )
(✓ +✓
Adjust parameters to fix
horizontal position and steepness
of step
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22
Logistic regression
Compute the coefficients?
Solve by gradient descent
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22
Logistic regression
Compute the coefficients?
Solve by gradient descent
Need derivatives to exist
Hence smooth sigmoid, not
step function
Check that
0
(z) = (z)(1 (z))
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22
Logistic regression
Compute the coefficients?
Solve by gradient descent
Need derivatives to exist
Hence smooth sigmoid, not
step function
Check that
0
(z) = (z)(1 (z))
Need a cost function to minimize
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22
MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
Lacto
i=1
↳preducted
class
class
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22
MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22
MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
1X
n S
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
O
n
@C 2X @ (zi )
= (yi (zi )) ·
@✓j n @✓j
i=1
-1.80
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22
MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
↓v'(z)
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2 X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
=
i=1 i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22
MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
i=1 i=1
n
X
2
= ( (zi ) yi ) 0 (zi )xij
n
i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22
MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
i=1 i=1
n
X
2
= ( (zi ) yi ) 0 (zi )xij
n
i=1
n n
@C 2X @ (zi ) @zi 2X
= ( (zi ) yi ) = ( (zi ) yi ) 0 (zi )
@✓0 n @zi @✓0 n
i=1 i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22
MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1
@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22
MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1
@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22
MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1
@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large
(z) is flat at both extremes
If (z) is completely wrong,
(z) ⇡ (1 y ), we still have
0 (z) ⇡ 0
Learning is slow even when current
model is far from optimal
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22
Loss function for logistic regression
Goal is to maximize log likelihood
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Loss function for logistic regression
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ).
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Loss function for logistic regression
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Loss function for logistic regression
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
&
y1
=
-
Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
-
↑
y =
0
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Loss function for logistic regression
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Loss function for logistic regression
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
U(G) log(π -)
n
Y -
h✓ (xi )yi · (1 h✓ (xi ))1 yi
-
Likelihood: L(✓) =
i=1
n
X
=
Elg( -
.)
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi )) =
i=1
*
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Loss function for logistic regression
#
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ), vz
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
i=1
n
X
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
-
i=1
n
X
Minimize cross entropy: yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Cross entropy and gradient descent
C= [y ln( (z)) + (1 y ) ln(1 (z))]
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22
Cross entropy and gradient descent
C= [y ln( (z)) + (1 y ) ln(1 (z))]
@C @C @
=
@✓j @ @✓j
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22
Cross entropy and gradient descent
↓0
C= [y ln( (z)) + (1 y ) ln(1 (z))]
@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j
-
~
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22
Cross entropy and gradient descent
C= [y ln( (z)) + (1 y ) ln(1 (z))]
@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j
y 1 y @ @z
=
(z) 1 (z) @z @✓j
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22
Cross entropy and gradient descent
C= [y ln( (z)) + (1 y ) ln(1 (z))]
@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j
y 1 y @ @z
=
(z) 1 (z) @z @✓j
y 1 y 0
= (z)xj
(z) 1 (z)
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22
Cross entropy and gradient descent
C= [y ln( (z)) + (1 y ) ln(1 (z))]
@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j
y 1 y @ @z
=
(z) 1 (z) @z @✓j
y 1 y 0
= (z)xj
(z) 1 (z)
y (1 (z)) (1 y ) (z) 0
= (z)xj
(z)(1 (z))
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22
Cross entropy and gradient descent . . .
@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
- -
Recall that 0 (z) = (z)(1 (z))
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22
Cross entropy and gradient descent . . .
@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
Recall that 0 (z) = (z)(1 (z))
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22
Cross entropy and gradient descent . . .
@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
Recall that 0 (z) = (z)(1 (z))
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22
Cross entropy and gradient descent . . .
@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
Recall that 0 (z) = (z)(1 (z))
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y/ (z) /(z) + y (z)]xj
= ( (z) y )xj
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22
Cross entropy and gradient descent . . .
@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
Recall that 0 (z) = (z)(1 (z))
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22
Cross entropy and gradient descent . . .
@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
Recall that 0 (z) = (z)(1 (z))
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Thus, as we wanted, the gradient is proportional to (z) y
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22
Cross entropy and gradient descent . . .
@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
Recall that 0 (z) = (z)(1 (z))
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Thus, as we wanted, the gradient is proportional to (z) y
The greater the error, the faster the learning rate
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22