How calculating hessian works for Neural Network learning
Asked Answered
E

1

10

Can anyone explain to me in a easy and less mathematical way what is a Hessian and how does it work in practice when optimizing the learning process for a neural network ?

Esch answered 25/4, 2014 at 15:22 Comment(0)
Y
19

To understand the Hessian you first need to understand Jacobian, and to understand a Jacobian you need to understand the derivative

  • Derivative is the measure of how fast function value changes withe the change of the argument. So if you have the function f(x)=x^2 you can compute its derivative and obtain a knowledge how fast f(x+t) changes with small enough t. This gives you knowledge about basic dynamics of the function
  • Gradient shows you in multidimensional functions the direction of the biggest value change (which is based on the directional derivatives) so given a function ie. g(x,y)=-x+y^2 you will know, that it is better to minimize the value of x, while strongly maximize the vlaue of y. This is a base of gradient based methods, like steepest descent technique (used in the traditional backpropagation methods).
  • Jacobian is yet another generalization, as your function might have many values, like g(x,y)=(x+1, xy, x-z), thus you now have 23 partial derivatives, one gradient per each output value (each of 2 values) thus forming together a matrix of 2*3=6 values.

Now, derivative shows you the dynamics of the function itself. But you can go one step further, if you can use this dynamics to find the optimum of the function, maybe you can do even better if you find out the dynamics of this dynamics, and so - compute derivatives of second order? This is exactly what Hessian is, it is a matrix of second order derivatives of your function. It captures the dynamics of the derivatives, so how fast (in what direction) does the change change. It may seem a bit complex at the first sight, but if you think about it for a while it becomes quite clear. You want to go in the direction of the gradient, but you do not know "how far" (what is the correct step size). And so you define new, smaller optimization problem, where you are asking "ok, I have this gradient, how can I tell where to go?" and solve it analogously, using derivatives (and derivatives of the derivatives form the Hessian).

You may also look at this in the geometrical way - gradient based optimization approximates your function with the line. You simply try to find a line which is closest to your function in a current point, and so it defines a direction of change. Now, lines are quite primitive, maybe we could use some more complex shapes like.... parabolas? Second derivative, hessian methods are just trying to fit the parabola (quadratic function, f(x)=ax^2+bx+c) to your current position. And based on this approximation - chose the valid step.

Yaker answered 28/4, 2014 at 8:22 Comment(9)
I knew about finding stationary point of a function using Newton's method (Hessian). But I still do not understand how to compute Hessian for neural networks since there are different layers and different activation functions on the way, and then apply it for weight update. Also, you did not explain anything about Jacobian in your answer. Did you want to say something and you forgot to do so?Standford
Jacobian is just a generatlization of gradient, it is a matrix of all partial derivatives in respect to each output variable and weight in networkYaker
I'm just learning about different methods of training neural networks, except backpropagation. So I still am not clear how one should compute Hessian for neural networks and perform weight update. Can you explain more, or can your point me to a tutorial/paper describing the idea?Standford
In short - backpropagation is not a learning technique - it is just an efficient way of computing gradient, nothing more, and actually all NN learning techniques are gradient based (hessian is just "one step deeper", it is a gradient of a gradient). I can suggest "Neural Networks and Learning Machines" by S Haykin. Or if you are not at all familiar with optimization - "Numerical Analysis" by D KincaidYaker
I don't get why you say one must know about jacobian first, and then never talk about it again.Arabeila
@Yaker Would you have any references to that fun fact on momentum?Socman
@Yaker I join the request by Mr Tsjolder - is there any paper that shows this?Harlene
@Yaker +1 for the "fun fact", any references?Strickland
I removed the fun fact from the answer as I do not remember which paper I was referring to 8 years ago :(Yaker

© 2022 - 2025 — McMap. All rights reserved.