KEMBAR78
Hessian Matrix in Neural Networks | PDF | Artificial Neural Network | Machine Learning
0% found this document useful (0 votes)
196 views12 pages

Hessian Matrix in Neural Networks

The document discusses the Hessian matrix and its role in machine learning. It defines the gradient and Hessian, and describes how backpropagation can be used to compute the Hessian. It then outlines several methods for evaluating the Hessian matrix, including diagonal approximation, outer product approximation, and exact evaluation using backpropagation.

Uploaded by

Suvij
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
196 views12 pages

Hessian Matrix in Neural Networks

The document discusses the Hessian matrix and its role in machine learning. It defines the gradient and Hessian, and describes how backpropagation can be used to compute the Hessian. It then outlines several methods for evaluating the Hessian matrix, including diagonal approximation, outer product approximation, and exact evaluation using backpropagation.

Uploaded by

Suvij
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Machine Learning Srihari

The Hessian Matrix

Sargur Srihari

1
Machine Learning Srihari

Definitions of Gradient and Hessian


•  First derivative of a scalar function E(w) with respect to a vector
w=[w1,w2]T is a vector called the Gradient of E(w)
⎡ ∂E ⎤
⎢ ⎥
d ⎢ ∂w1 ⎥
∇E(w) = E(w) = ⎢ If there are M elements in the vector
dw
⎢ ∂E ⎥⎥ then Gradient is a M x 1 vector
⎢ ∂w2 ⎥
⎣ ⎦
•  Second derivative of E(w) is a matrix called the Hessian of
E(w)
⎡ ∂2 E ∂2 E ⎤
⎢ ⎥
d 2
⎢ ∂w1
2
∂w1 ∂w2 ⎥
H = ∇∇E(w) = 2
E(w) = ⎢ ⎥
dw ⎢ ∂E
2
∂2 E ⎥ Hessian is a matrix with
⎢ ∂w2 ∂w1 ∂w2 2 ⎥ M2 elements
⎣ ⎦
•  Jacobian is a matrix consisting of first derivatives wrt a vector
Machine Learning Srihari

Computing the Hessian using Backpropagation


•  We have shown how backpropagation can be used to obtain
first derivatives of error function wrt weights in network
•  Backpropagation can also be used to derive second derivatives
∂2 E
∂w jiwlk
•  If all weights and bias parameters are elements wi of single
vector w then the second derivatives form the elements Hij of
Hessian matrix H where i,j ε {1,..W} and W is the total no of
weights and biases
Machine Learning Srihari

Role of Hessian in Neural Computing


1.  Several nonlinear optimization algorithms for neural
networks
•  are based on second order properties of error surface
2.  Basis for fast procedure for retraining with small
change of training data
3.  Identifying least significant weights
•  For network pruning requires inverse of Hessian
4.  Bayesian neural network
•  Central role in Laplace approximation
•  Hessian inverse is used to determine the predictive distribution for a
trained network
4
•  Hessian eigenvalues determine the values of hyperparameters
•  Hessian determinant is used to evaluate the model evidence
Machine Learning Srihari

Evaluating the Hessian Matrix


•  Full Hessian matrix can be difficult to compute in practice
•  quasi-Newton algorithms have been developed that use approximations
to the Hessian
•  Various approximation techniques have been used to evaluate
the Hessian for a neural network
•  calculated exactly using an extension of backpropagation
•  Important consideration is efficiency
•  With W parameters (weights and biases) matrix has dimension W x W
•  Efficient methods have O(W2)

5
Machine Learning Srihari

Methods for evaluating the Hessian Matrix


•  Diagonal Approximation
•  Outer Product Approximation
•  Inverse Hessian
•  Finite Differences
•  Exact Evaluation using Backpropagation
•  Fast multiplication by the Hessian

6
Machine Learning Srihari

Diagonal Approximation

•  In many case inverse of Hessian is needed


•  If Hessian is approximated by a diagonal matrix (i.e., off-
diagonal elements are zero), its inverse is trivially computed
•  Complexity is O(W) rather than O(W2) for full Hessian

7
Machine Learning Srihari

Outer product approximation


•  Neural networks commonly use sum-of-squared
errors function
1 N
E = ∑ (yn − tn )2
2 n=1
•  Can write Hessian matrix in the form
n
H ≈ ∑ bn bn
T

n=1

•  Where b = ∇y = ∇a
n n n
•  Elements can be found in O(W2) steps
8
Machine Learning Srihari

Inverse Hessian

•  Use outer product approximation to obtain


computationally efficient procedure for approximating
inverse of Hessian

9
Machine Learning Srihari

Finite Differences

•  Using backprop, complexity is reduced from O(W3)


to O(W2)

10
Machine Learning Srihari

Exact Evaluation of the Hessian

•  Using an extension of backprop


•  Complexity is O(W2)

11
Machine Learning Srihari

Fast Multiplication by the Hessian

•  Application of the Hessian involve multiplication by


the Hessian
•  The vector vTH has only W elements
•  Instead of computing H as an intermediate step, find
efficient method to compute product

12

You might also like