matlabfloating-pointnumerical-methodslogistic-regressionnumerical-stability

Avoiding numerical overflow when calculating the value AND gradient of the Logistic loss function


I am currently trying to implement a machine learning algorithm that involves the logistic loss function in MATLAB. Unfortunately, I am having some trouble due to numerical overflow.

In general, for a given an input s, the value of the logistic function is:

 log(1 + exp(s))

and the slope of the logistic loss function is:

 exp(s)./(1 + exp(s)) = 1./(1 + exp(-s))

In my algorithm, the value of s = X*beta. Here X is a matrix with N data points and P features per data point (i.e. size(X)=[N,P]) and beta is a vector of P coefficients for each feature such that size(beta)=[P 1].

I am specifically interested in calculating the average value and gradient of the Logistic function for given value of beta.

The average value of the Logistic function w.r.t to a value of beta is:

 L = 1/N * sum(log(1+exp(X*beta)),1)

The average value of the slope of the Logistic function w.r.t. to a value of b is:

 dL = 1/N * sum((exp(X*beta)./(1+exp(X*beta))' X, 1)'

Note that size(dL) = [P 1].

My issue is that these expressions keep producing numerical overflows. The problem effectively comes from the fact that exp(s)=Inf when s>1000 and exp(s)=0 when s<-1000.

I am looking for a solution such that s can take on any value in floating point arithmetic. Ideally, I would also really appreciate a solution that allows me to evaluate the value and gradient in a vectorized / efficient way.


Solution

  • How about the following approximations:

    – For computing L, if s is large, then exp(s) will be much larger than 1:

    1 + exp(s) ≅ exp(s)
    

    and consequently

    log(1 + exp(s)) ≅ log(exp(s)) = s.
    

    If s is small, then using the Taylor series of exp()

    exp(s) ≅ 1 + s
    

    and using the Taylor series of log()

    log(1 + exp(s)) ≅ log(2 + s) ≅ log(2) + s / 2.
    

    – For computing dL, for large s

    exp(s) ./ (1 + exp(s)) ≅ 1
    

    and for small s

    exp(s) ./ (1 + exp(s)) ≅ 1/2 + s / 4.
    

    – The code to compute L could look for example like this:

    s = X*beta;
    l = log(1+exp(s));
    ind = isinf(l);
    l(ind) = s(ind);
    ind = (l == 0);
    l(ind) = log(2) + s(ind) / 2;
    L = 1/N * sum(l,1)