I am currently trying to implement a machine learning algorithm that involves the logistic loss function in MATLAB. Unfortunately, I am having some trouble due to numerical overflow.
In general, for a given an input s
, the value of the logistic function is:
log(1 + exp(s))
and the slope of the logistic loss function is:
exp(s)./(1 + exp(s)) = 1./(1 + exp(-s))
In my algorithm, the value of s = X*beta
. Here X
is a matrix with N
data points and P
features per data point (i.e. size(X)=[N,P]
) and beta
is a vector of P
coefficients for each feature such that size(beta)=[P 1]
.
I am specifically interested in calculating the average value and gradient of the Logistic function for given value of beta
.
The average value of the Logistic function w.r.t to a value of beta
is:
L = 1/N * sum(log(1+exp(X*beta)),1)
The average value of the slope of the Logistic function w.r.t. to a value of b
is:
dL = 1/N * sum((exp(X*beta)./(1+exp(X*beta))' X, 1)'
Note that size(dL) = [P 1].
My issue is that these expressions keep producing numerical overflows. The problem effectively comes from the fact that exp(s)=Inf
when s>1000
and exp(s)=0
when s<-1000.
I am looking for a solution such that s
can take on any value in floating point arithmetic. Ideally, I would also really appreciate a solution that allows me to evaluate the value and gradient in a vectorized / efficient way.
How about the following approximations:
– For computing L
, if s
is large, then exp(s)
will be much larger than 1:
1 + exp(s) ≅ exp(s)
and consequently
log(1 + exp(s)) ≅ log(exp(s)) = s.
If s
is small, then using the Taylor series of exp()
exp(s) ≅ 1 + s
and using the Taylor series of log()
log(1 + exp(s)) ≅ log(2 + s) ≅ log(2) + s / 2.
– For computing dL
, for large s
exp(s) ./ (1 + exp(s)) ≅ 1
and for small s
exp(s) ./ (1 + exp(s)) ≅ 1/2 + s / 4.
– The code to compute L
could look for example like this:
s = X*beta;
l = log(1+exp(s));
ind = isinf(l);
l(ind) = s(ind);
ind = (l == 0);
l(ind) = log(2) + s(ind) / 2;
L = 1/N * sum(l,1)