pythonmathematical-optimizationscipy-optimizescipy-optimize-minimize

Does the linesearch in scipy.optimize.minimize() with BFGS calculate gradients of the objective function for different values of the step size?


The BFGS algorithm calculates the gradient of the objective function, and then performs a linesearch along the gradient direction to decide on a step size. For the function I am attempting to minimize, the gradient calculation is extremely expensive (as compared to a simple function evaluation), so it would be best to avoid extra computations of the gradient. During the linesearch phase of BFGS, are repeated calls made to the gradient (i.e. the jac callable provided to scipy.optimize.minimize())?

A 'no' answer would provide good motivation to take advantage of that fact in refactoring the code.

The scipy source code is a bit hard to step through, so any insight would be appreciated, and save me plenty of time.


Solution

  • During the linesearch phase of BFGS, are repeated calls made to the gradient?

    This is a slightly ambiguous question, because there are two slightly different questions you could be asking.

    1. Are multiple calls made to the gradient during a single BFGS line search?

      Answer: Yes, it can use between one and 110 calls to the gradient during one line search.

    2. Are multiple calls made to the gradient during a single line search, with the same arguments?

      Answer: No.

    Are multiple calls made to the gradient during a single BFGS line search?

    It can call the gradient multiple times, but it doesn't always.

    BFGS calls either line_search_wolfe1 or line_search_wolfe2 to find the gradient, and both of those call the gradient in a loop.

    line_search_wolfe1 uses MINPACK's DCSRCH function to find how far of a step size to use. In the process of doing this, it loops up to 100 times. If DCSRCH exits with task set to FG, then line_search_wolfe1 evaluates the function and its gradient. Therefore it can call the gradient function up to 100 times. (Note: it is planned to port DCSRCH to Python. If you are reading this in the future, this may no longer be implemented using MINPACK.)

    If line_search_wolfe1 fails to find a step size, then line_search_wolfe2 is called next.

    The line_search_wolfe2 function uses a different algorithm to determine step size, and it can also call the derivative function. It is in a loop with a maximum number of iterations of 10, so it can call the gradient up to 10 times.

    Combining the two, BFGS can evaluate the gradient up to a 110 times. Usually it gets it in one or two evaluations, but it's possible to get 110.

    Here is are the functions I investigated to find this:

    Next, I investigated whether 110 gradient evaluations are actually possible. I tried optimizing a number of different functions, while measuring how many gradient evaluations were done per linesearch. SciPy's API doesn't expose this information, so I had to do it by hooking the SciPy functions responsible.

    import scipy.optimize
    import numpy as np
    import sys
    
    old_line_search = scipy.optimize._optimize._line_search_wolfe12
    jac_count = 0
    def new_line_search(*args, **kwargs):
        global jac_count
        print("in line search")
        jac_count = 0
        ret = old_line_search(*args, **kwargs)
        print(f"leaving line search, did {jac_count} jacs")
        return ret
    scipy.optimize._optimize._line_search_wolfe12 = new_line_search
    
    options = {
        'gtol': 1e-8
    }
    
    def func(x):
        return max(-(x[0])**2, -1e+120)
    
    def funcjac(x):
        global jac_count
        print(f"in jac, x={x}")
        jac_count += 1
        return scipy.optimize.approx_fprime(x, func)
    
    x0 = [0]
    result = scipy.optimize.minimize(func, x0, jac=funcjac, method='BFGS', options=options)
    print(result)
    

    If you run this, in the output you'll see this:

    leaving line search, did 110 jacs
    

    So it is not just theoretical - it is possible to get it to use all 110 jacobian evaluations. Admittedly, this is a pretty contrived function to minimize.

    Are multiple calls made to the gradient during a single line search, with the same arguments?

    No. There is a class, ScalarFunction, which is responsible for memoizing (caching) calls to the function and its jacobian. Even if BFGS made multiple calls to the jacobian function, ScalarFunction would use its cache.