SciPy Minimization Convergence Problems for Objective Function with Small Values and Numerical Derivatives

I’m having issues with minimization with SciPy for an objective function that returns a small value, and for a problem where I would like to use numerical derivatives in a gradient-based algorithm. It for a problem related to optimal sampling sizes from Cochran’s 1977 “Sampling Techniques” textbook. My minimum working example is as follows:


import numpy as np
import math
from scipy.optimize import minimize

#parameters
n_strata=4
strata_size = [0.25, 0.25, 0.25, 0.25]
strata_variance = [0.25, 0.25, 0.25, 0.25]
total_sample = 8000

#parameter to control the scale of the objective function
scale_value = 1.0


#objective function that the variance of a stratified estimator
def objective_function(n_vec_in_short):

    #convert from numpy to a list
    if isinstance(n_vec_in_short,np.ndarray):
        n_vec_in_short_list= n_vec_in_short.tolist()
    else:
        n_vec_in_short_list = n_vec_in_short
    
    #get the sample size of the last stratum 
    n_last_stratum = total_sample - sum( n_vec_in_short_list)
    
    #now create the vector that is of lenght n_strata
    n_vec = n_vec_in_short_list
    n_vec.append(n_last_stratum)
   
    variance_total=0.0

    #get the sum of variance across strata
    for h in range(0,n_strata):
        variance_total = variance_total + (strata_size[h]**2)*strata_variance[h]/n_vec[h]

    return variance_total*scale_value

solution = minimize(objective_function, [2500,2500,2500] , method='BFGS' )
print(solution)


true_optimum = objective_function([2000,2000,2000])
print(true_optimum)

My issue is that if I set the scale value for the objective function to 1.0, it returns only the initial guess of [2500,2500,2500]. However, if I change the scale value to 100000.0, I do get the optimal solution.

I know how to fix this for the minimum working example. But for the actual problem I’m working on, I would appreciate help understanding SciPy better to understand how best to set parameters for algorithms that use numerical derivatives, to better avoid convergence issues like this.

Tried ad-hoc solutions that worked, but want community help to understand minimize better to have a less ad-hoc solution

Solution

BFGS has a couple of different reasons why it will terminate an optimization process. One of them is when the gradient is too small. Specifically, if the euclidean norm of the gradient is smaller than the solver option gtol, it will exit. By default gtol is 1e-5.

This is why multiplying the output of the function changes this. If the value of the function is multiplied by 100000, then the gradient is also multiplied by 100000.

But why is the gradient so small? To explain this, I varied total_sample from 1000 to 80000, capturing the gradient via numeric differentiation at the point which was 20% away from the optimum. (This is the same distance as your example of starting at 2500 when the optimum is 2000.)

Some things to notice about this graph:

The plot is very close to total_samples ** -2. This means that for reasonable values of total_samples, you will get very small gradients.
The default value of gtol will cause quick termination for all but the very smallest values of total_samples.

One simple thing you could do to fix this is to set gtol to depend on total_samples.

solution = minimize(objective_function, [2500,2500,2500] , method='BFGS', options=dict(gtol=1e-5*total_sample**-2))

Alternatively, you could completely disable termination due to gtol.

solution = minimize(objective_function, [2500,2500,2500] , method='BFGS', options=dict(gtol=0))

This will still terminate, as BFGS can still terminate due to xtol or exceeding the maximum number of iterations.

Finally, I will note that strictly speaking, the optimal solution you have is not optimal. The solution [-2000, -2000, -2000] scores better; you may want to use bounds to prevent negative x values, and a constraint to prevent n_last_stratum from being negative.