The BFGS algorithm calculates the gradient of the objective function, and then performs a linesearch along the gradient direction to decide on a step size.
For the function I am attempting to minimize, the gradient calculation is extremely expensive (as compared to a simple function evaluation), so it would be best to avoid extra computations of the gradient. During the linesearch phase of BFGS, are repeated calls made to the gradient (i.e. the jac
callable provided to scipy.optimize.minimize()
)?
A 'no' answer would provide good motivation to take advantage of that fact in refactoring the code.
The scipy source code is a bit hard to step through, so any insight would be appreciated, and save me plenty of time.
During the linesearch phase of BFGS, are repeated calls made to the gradient?
This is a slightly ambiguous question, because there are two slightly different questions you could be asking.
Are multiple calls made to the gradient during a single BFGS line search?
Answer: Yes, it can use between one and 110 calls to the gradient during one line search.
Are multiple calls made to the gradient during a single line search, with the same arguments?
Answer: No.
It can call the gradient multiple times, but it doesn't always.
BFGS calls either line_search_wolfe1 or line_search_wolfe2 to find the gradient, and both of those call the gradient in a loop.
line_search_wolfe1 uses MINPACK's DCSRCH function to find how far of a step size to use. In the process of doing this, it loops up to 100 times. If DCSRCH exits with task set to FG, then line_search_wolfe1 evaluates the function and its gradient. Therefore it can call the gradient function up to 100 times. (Note: it is planned to port DCSRCH to Python. If you are reading this in the future, this may no longer be implemented using MINPACK.)
If line_search_wolfe1 fails to find a step size, then line_search_wolfe2 is called next.
The line_search_wolfe2 function uses a different algorithm to determine step size, and it can also call the derivative function. It is in a loop with a maximum number of iterations of 10, so it can call the gradient up to 10 times.
Combining the two, BFGS can evaluate the gradient up to a 110 times. Usually it gets it in one or two evaluations, but it's possible to get 110.
Here is are the functions I investigated to find this:
Next, I investigated whether 110 gradient evaluations are actually possible. I tried optimizing a number of different functions, while measuring how many gradient evaluations were done per linesearch. SciPy's API doesn't expose this information, so I had to do it by hooking the SciPy functions responsible.
import scipy.optimize
import numpy as np
import sys
old_line_search = scipy.optimize._optimize._line_search_wolfe12
jac_count = 0
def new_line_search(*args, **kwargs):
global jac_count
print("in line search")
jac_count = 0
ret = old_line_search(*args, **kwargs)
print(f"leaving line search, did {jac_count} jacs")
return ret
scipy.optimize._optimize._line_search_wolfe12 = new_line_search
options = {
'gtol': 1e-8
}
def func(x):
return max(-(x[0])**2, -1e+120)
def funcjac(x):
global jac_count
print(f"in jac, x={x}")
jac_count += 1
return scipy.optimize.approx_fprime(x, func)
x0 = [0]
result = scipy.optimize.minimize(func, x0, jac=funcjac, method='BFGS', options=options)
print(result)
If you run this, in the output you'll see this:
leaving line search, did 110 jacs
So it is not just theoretical - it is possible to get it to use all 110 jacobian evaluations. Admittedly, this is a pretty contrived function to minimize.
No. There is a class, ScalarFunction, which is responsible for memoizing (caching) calls to the function and its jacobian. Even if BFGS made multiple calls to the jacobian function, ScalarFunction would use its cache.