I am doing an optimisation problem using Scipy, where I am taking a flat network of vertices and bonds of size NNxNN
, connecting two sides of it (i.e., making it periodic), and minimising an energy function, so that it curls up to form a cylinder. (See the links below.)
Since I have the function energy(xyz-position)
and it's gradient, I decided to use the three methods recommended in the Scipy manual -- Newton-CG
, BFGS
, L-BFGS-B
-- and compare how they performed.
I call the optimisation function as follows, and I merely replace 'Newton-CG'
with 'BFGS'
and 'L-BFGS-B'
according to case:
from scipy.optimize import minimize
res = minimize(energy, xyzInit, method='Newton-CG', jac = energy_der, options={'disp': True})
I found the following general behaviour (I am giving the output data for the case of NN=9
, corresponding to a 3*9^2=243
-dimensional parameter space) -
BFGS systematically failed to find the correct minimum (for low
NN
), and failed to converge at all for largeNN
. See https://plot.ly/~apal90/162/ for end result.NN=9 Method: BFGS Warning: Desired error not necessarily achieved due to precision loss. Current function value: 204.465912 Iterations: 1239 Function evaluations: 1520 Gradient evaluations: 1508 Time taken for minimisation: 340.728140116
Newton-CG found the correct minimum for small
NN
(<=8), but starting from NN=9, returned an incorrect minimum (viz., a cylinder squashed at one end), and for higher values stopped even converging. Note: This behaviour was for some reason aggravated for oddNN
's. See https://plot.ly/~apal90/164/NN=9 Method: Newton-CG Optimization terminated successfully. Current function value: 7.954412 Iterations: 49 Function evaluations: 58 Gradient evaluations: 1654 Hessian evaluations: 0 Time taken for minimisation: 294.203114033
L-BFGS-B found the correct minimum, and that too blazingly fast, for all
NN
's that I tested (up toNN=14
). See https://plot.ly/~apal90/160/NN=9 Method: L-BFGS-B Time taken for minimisation: 36.3749790192
Question: Why is L-BFGS-B
superior in this case to the other two methods? In particular, why is it so much superior to BFGS
, when both are supposed to be quasi-Newton methods that work (to my understanding), in exactly the same manner.
My thoughts on the situation: All three methods do quadratic approximations at every point x. For this, it needs a gradient and a Hessian. If the Hessian is not given, it must be calculated by the algorithm. In our case, where only the gradient is explicitly given, this is calculated at every step numerically by the algorithm. More specifically, what we require is the inverse of the Hessian, and this is a very expensive step, especially in higher dimensions. Now, Newton-CG calculates this inverse Hessian explicitly, hence it's longer time requirements. The quasi-Newton methods like BFGS and L-BFGS calculate an approximation to the Hessian (i.e., the curvature) based on the gradient, which is cheaper on time, and which is also supposedly a better estimate of the curvature about a point. Thus, for quadratic functions, Newton-CG converges faster, whereas for non-quadratic functions, the quasi-Newton functions converge better. L-BFGS is a lower memory version of BFGS that stores far less memory at every step than the full NxN matrix, hence it is faster than BFGS.
This explanation shows a divergence between Newton-CG and the quasi-Newton methods. What it does not explain is the inability of the algorithms to find the true minimum, and especially the disparity between BFGS and L-BFGS, which are both supposed to function in the same manner.
My general hunch on the long convergence times is that the system is non-quadratic (i.e. flat) about the minimum, and thus the algorithm oscillates about with converging. Beyond that, if BFGS and L-BFGS truly work in the same manner, I believe there must be some difference between the convergence tolerance levels of the Scipy algorithms. Otherwise, BFGS and L-BFGS don't really work in the same manner, and the latter probably calculates the Hessian far more accurately.
References --
http://www.scipy-lectures.org/advanced/mathematical_optimization/#newton-and-quasi-newton-methods
https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization
https://en.wikipedia.org/wiki/Quasi-Newton_method
https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-bfgs.html#optimize-minimize-bfgs
energy()
small enough / self-contained enough to post, say on gist.github ? (Scalable test cases are useful for testing other optimizers too.) – Irrelevanceenergy
function is actually poorly behaved numerically or because the Jacobian function isn't actually correct. It's good to check the latter withscipy.optimize.check_grad
orscipy.optimize.approx_fprime
. In particular, I do convergence plots by varying the values ofepsilon
to those functions. – Schaerbeekepsilon
help with the final optimisation? It won't give you a better Jacobian function. – Sinlessepsilon
as one of the arguments tocheck_grad
orapprox_fprime
; it's not an argument tominimize
itself. What I'm suggesting is that you need to check that your Jacobian function is actually correct and numerically well behaved for the range of values thatminimize
will be testing, and you do that test with those first two functions. If you have the wrong Jacobian or if it's not well behaved numerically, feeding it in to theminimize
function can result in this "precision loss" warning. – Schaerbeek