cprecisionnumerical-methodsgradient-descent

Numerical instability of gradient descent in C


I wrote a simple gradient descent algorithm with the steepest descent method.

For steepes descent i mean that the step size is optimized to be the one that minimizes f(x - lambda*grad(f)) with lambda the step size with the result that each direction is orthogonal to the preceding.

The problem is that the program seems very unstable: it works (kinda) well only with quadratic functions but with even quartic functions like x^4 + y^4 + z^4 is unstable no matter the precision used

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

double l2_norm(double *x1, double *x2, int m);
void get_grad(double (*f)(double *x, int m), double *x, double *grad, int m);
void copy(double *a, double *b, int m);
void add(double *a, double *b, double lambda, int m);
double dFdLambda(double (*f)(double *x, int m), double *x, double *u, int m);
void descent(double (*f)(double *x, int m), double *x, double *u, int m);

void print_vec(double *x, int m){
    for(int i=0; i<m; i++){
        printf("%f\n", x[i]);
    }
    return;
}

double f_temp(double *x, int m){

    double f=0.;

    for(int i=0; i<m; i++){
        f += (x[i]-1)*(x[i]-1);
    }
    return f;
}

int main(){

    double x[] = {2., 2., 2.};
    double *grad;
    double *x_old;
    int m = 3;
    double e = 1.E-6;

    grad = malloc(m*sizeof(double));
    x_old = malloc(m*sizeof(double));
    
    while(l2_norm(x, x_old, m) > e){
        copy(x, x_old, m);
        get_grad(f_temp, x, grad, m);
        descent(f_temp, x, grad, m);
        print_vec(x, m);
    }

    printf("\n");
    print_vec(x, m);
    

    return 0;
}


double l2_norm(double *x1, double *x2, int m){

    double norm = 0;

    for(int i=0;i<m;i++){
        norm += pow(x1[i]-x2[i], 2);
    }

    norm = sqrt(norm);

    return norm;
}


void get_grad(double (*f)(double *x, int m), double *x, double *grad, int m){
    /*
    numerical gradient with simmetric method

    the actual gradient computed is -grad in order to be used in 
    gradient descent
    */
    double e = 1.E-6; // numerical precision
    double *x_forward;
    double *x_backward;
    x_forward = malloc(m*sizeof(double));
    x_backward = malloc(m*sizeof(double));
    
    for(int i=0;i<m; i++){
        x_forward[i] = x[i] + e;
        x_backward[i] = x[i] - e;    
        grad[i] = -(f(x_forward, m) - f(x_backward, m))/(2*e);
        x_forward[i] -= e;
        x_backward[i] += e;
    }
    free(x_forward);
    free(x_backward);
    return;
}


double dFdLambda(double (*f)(double *x, int m), double *x, double *u, int m){
    /*
    compute the total derivative dF/dLambda in the point x along
    the direction u
    */
    double e = 1.E-5;
    double *x_forward;
    double *x_backward;
    double der;

    x_forward = malloc(m*sizeof(double));
    x_backward = malloc(m*sizeof(double));
    copy(x, x_forward, m);
    copy(x, x_backward, m);
    add(x_forward, u, e, m);
    add(x_backward, u, -e, m);

    der = (f(x_forward, m) - f(x_backward, m))/(2*e);

    free(x_forward);
    free(x_backward);

    return der;
}


void copy(double *a, double *b, int m){
    //copies a into b
    for(int i=0; i<m; i++){
        b[i] = a[i];
    }
    return;
}


void add(double *a, double *b, double lambda, int m){
    /*
    adds lambda*b to a
    */
    for(int i=0; i<m; i++){
        a[i] += lambda*b[i];
    }
    return;
}


void descent(double (*f)(double *x, int m), double *x, double *u, int m){
    /*
    actual gradient descent starting from x going in direction u
    */
    double e = 1.E-5;
    double derA, derB, derC;
    double *x_start;
    double lambda, lambda_min, lambda_max;

    lambda = e;

    x_start = malloc(m*sizeof(double));
    copy(x, x_start, m);
    derA = dFdLambda(f, x, u, m);
    derC = derA;

    /*
    this while loop finds the interval in which the total derivatives df/dl
    changes sign i.e. the interval in which the solution of df/dl = 0
    will be searched with bisection method
    */
    while((derA*derC) >= 0.){
        copy(x_start, x, m);
        add(x, u, lambda, m);
        derC = dFdLambda(f, x, u, m);
        lambda *= 2.;
    }

    /*
    because lambda >= 0 the leftmost point of the interval is 0, the rightmost 
    the point found before after which the total derivatives changes sign
    */
    lambda_min = 0;
    lambda_max = lambda;
    lambda = 0.5*(lambda_min + lambda_max);

    while((fabs(lambda_max - lambda_min)) > e){
        copy(x_start, x, m);
        add(x, u, lambda_min, m);
        derA = dFdLambda(f, x, u, m);

        copy(x_start, x, m);
        add(x, u, lambda, m);
        derB = dFdLambda(f, x, u, m);

        if((derA*derB) > 0.){
            lambda_min = lambda;
        } 
        else if((derA*derB) < 0.){
            lambda_max = lambda;
        }
        else{
            lambda_min = lambda_max = lambda;
        }

        lambda = 0.5*(lambda_min + lambda_max);
    }

    copy(x_start, x, m);
    // the modified vector x will cointain the coordinates of the minumum
    add(x, u, lambda, m); 
    free(x_start);

    return;
}

with f += (x[i]-1)*(x[i]-1) the output is correct

1.000005
1.000005
1.000005
-1.499995
-1.499995
1.000005
1.000012
1.000012
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000

1.000000
1.000000
1.000000

but using (x[i]-1)*(x[i]-1)*(x[i]-1)*(x[i]-1) or even the pow() function with power 2 the results are similar to

1.000010
1.000010
1.000010
-1.499990
-1.499990
1.000010
1.000292
1.000292
1.000010
1.000292
1.000000
1.000010
1.000292
-1032674302637538115769421495899732814941072310657452010951926707678661836800.000000
-171980112174886070524247400874370796081962558827236065865505533642887865040896.000000
inf
inf
nan

inf
inf
nan

Lowering the numerical precision inside the functions (the double e variables) seems to work only for simple quadratic functions (without using the pow() function), otherwise i get null or inf numbers


Solution

  • You have some issues with using allocated objects while their values are indeterminate. The first one is in main:

        x_old = malloc(m*sizeof(double));
        
        while(l2_norm(x, x_old, m) > e){
    

    Note that you are using the data to which x_old points without ever having assigned any values to it. Since you always want to do at least one iteration, it would make sense to restructure the while loop as a do ... while.

    But you have some more in get_grad() that are probably a lot more impactful:

        x_forward = malloc(m*sizeof(double));
        x_backward = malloc(m*sizeof(double));
        
        for(int i=0;i<m; i++){
            x_forward[i] = x[i] + e;
            x_backward[i] = x[i] - e;    
            grad[i] = -(f(x_forward, m) - f(x_backward, m))/(2*e);
            x_forward[i] -= e;
            x_backward[i] += e;
        }
    

    Note that you start using the data pointed-to by x_forward and x_backward right after allocation, when those values are undefined. You eventually assign values to all the allocated elements, but you initially assign just element 0 of each. Presumably, you want to copy the data from x into the spaces to which these point prior to starting the loop. As presently written, without such copying, the resulting behavior is undefined. There is absolutely no reason to expect that the resulting data stored in *grad will be an estimate of the gradient at x.

    Even if you could rely on malloc() initializing the allocated space to all-zero (which it does not do, but calloc() does), the program would be wrong. In particular, the get_grad() function would, in general, compute each component of the supposed gradient at a different point.

    I was able to reproduce your program's erroneous output fairly closely. The computation took a similar path, diverging to NaN after not too many iterations.

    After fixing those two issues, the program produces this output instead:

    1.000005
    1.000005
    1.000005
    1.000000
    1.000000
    1.000000
    1.000000
    1.000000
    1.000000
    
    1.000000
    1.000000
    1.000000
    

    Update

    The program also has an implementation error in function dFdLambda(). It computes the derivative of f in the chosen direction as

        der = (f(x_forward, m) - f(x_backward, m))/(2*e);
    

    , but the divisor in that formula is incorrect, grossly so in some cases. It should be the distance (L2 norm) between x_forward and x_backward. When I make that correction, the resulting program converges in the quartic case, too.*


    I imagine that you can probably still find input functions where the program misbehaves. In particular, you may have trouble with functions that have even steeper gradients in the vicinity of the local minimum, and with those that oscillate very rapidly. You may also have trouble with functions that are distinctly anisotropic in the various dimensions of the domain, and with functions that have local minimums with very large values.

    Addressing the issue that @chux describes would help with some of that. A more adaptive approach to estimating the gradient and derivative might help with some others.

    And the descent computation itself could be protected against divergence. For example, it could verify that the value of f in fact does decrease, and / or the change in x per descent step could be damped or capped in various ways.

    Also, I understand mathematically why descent() estimates derivatives of the function and uses them to search for a minimum along the chosen direction, by computationally, you would probably be better off searching for a minimum directly, using the values of the function instead of estimates of its derivative. That would certainly reduce the likelihood of divergences such as you observed, and its convergence properties in general are likely to be as good or better.

    Ultimately, however, numerical programs are constrained by the numeric representations they use, which inevitably have limited range and precision (though you can choose the specific limits more or less arbitrarily if you want to do). This is a fundamental constraint of numerical programming.


    *Though I'm not actually sure why it makes a difference, because the program seems to depend only on the sign of the derivative, not its magnitude.