Machine Learning Srihari
The Hessian Matrix
Sargur Srihari
1
Machine Learning Srihari
Definitions of Gradient and Hessian
• First derivative of a scalar function E(w) with respect to a vector
w=[w1,w2]T is a vector called the Gradient of E(w)
⎡ ∂E ⎤
⎢ ⎥
d ⎢ ∂w1 ⎥
∇E(w) = E(w) = ⎢ If there are M elements in the vector
dw
⎢ ∂E ⎥⎥ then Gradient is a M x 1 vector
⎢ ∂w2 ⎥
⎣ ⎦
• Second derivative of E(w) is a matrix called the Hessian of
E(w)
⎡ ∂2 E ∂2 E ⎤
⎢ ⎥
d 2
⎢ ∂w1
2
∂w1 ∂w2 ⎥
H = ∇∇E(w) = 2
E(w) = ⎢ ⎥
dw ⎢ ∂E
2
∂2 E ⎥ Hessian is a matrix with
⎢ ∂w2 ∂w1 ∂w2 2 ⎥ M2 elements
⎣ ⎦
• Jacobian is a matrix consisting of first derivatives wrt a vector
Machine Learning Srihari
Computing the Hessian using Backpropagation
• We have shown how backpropagation can be used to obtain
first derivatives of error function wrt weights in network
• Backpropagation can also be used to derive second derivatives
∂2 E
∂w jiwlk
• If all weights and bias parameters are elements wi of single
vector w then the second derivatives form the elements Hij of
Hessian matrix H where i,j ε {1,..W} and W is the total no of
weights and biases
Machine Learning Srihari
Role of Hessian in Neural Computing
1. Several nonlinear optimization algorithms for neural
networks
• are based on second order properties of error surface
2. Basis for fast procedure for retraining with small
change of training data
3. Identifying least significant weights
• For network pruning requires inverse of Hessian
4. Bayesian neural network
• Central role in Laplace approximation
• Hessian inverse is used to determine the predictive distribution for a
trained network
4
• Hessian eigenvalues determine the values of hyperparameters
• Hessian determinant is used to evaluate the model evidence
Machine Learning Srihari
Evaluating the Hessian Matrix
• Full Hessian matrix can be difficult to compute in practice
• quasi-Newton algorithms have been developed that use approximations
to the Hessian
• Various approximation techniques have been used to evaluate
the Hessian for a neural network
• calculated exactly using an extension of backpropagation
• Important consideration is efficiency
• With W parameters (weights and biases) matrix has dimension W x W
• Efficient methods have O(W2)
5
Machine Learning Srihari
Methods for evaluating the Hessian Matrix
• Diagonal Approximation
• Outer Product Approximation
• Inverse Hessian
• Finite Differences
• Exact Evaluation using Backpropagation
• Fast multiplication by the Hessian
6
Machine Learning Srihari
Diagonal Approximation
• In many case inverse of Hessian is needed
• If Hessian is approximated by a diagonal matrix (i.e., off-
diagonal elements are zero), its inverse is trivially computed
• Complexity is O(W) rather than O(W2) for full Hessian
7
Machine Learning Srihari
Outer product approximation
• Neural networks commonly use sum-of-squared
errors function
1 N
E = ∑ (yn − tn )2
2 n=1
• Can write Hessian matrix in the form
n
H ≈ ∑ bn bn
T
n=1
• Where b = ∇y = ∇a
n n n
• Elements can be found in O(W2) steps
8
Machine Learning Srihari
Inverse Hessian
• Use outer product approximation to obtain
computationally efficient procedure for approximating
inverse of Hessian
9
Machine Learning Srihari
Finite Differences
• Using backprop, complexity is reduced from O(W3)
to O(W2)
10
Machine Learning Srihari
Exact Evaluation of the Hessian
• Using an extension of backprop
• Complexity is O(W2)
11
Machine Learning Srihari
Fast Multiplication by the Hessian
• Application of the Hessian involve multiplication by
the Hessian
• The vector vTH has only W elements
• Instead of computing H as an intermediate step, find
efficient method to compute product
12