I am trying to maximize a very simple likelihood. This is a non-parametric likelihood
in the sense that the distribution of F is not specified parametrically. Rather,
for each observed xi
, f(xi)=pi
and thus log(Likelihood)=Sum(log(f(xi)))=Sum(log(pi))
.
The function I am trying to maximize is: sum(log(pi))+lamda(sum(pi-1))
where sum(pi)=1
(i.e. this is a constrained maximization problem which can be solved using Lagrange multiplier).
The answer to this problem is pi=1/n
where n
is the number of data points. However, optimx does not seem to give this solution. Does anybody have any idea. If n=2
, the function I am maximizing is log(p1)+log(p2)+lamda(p1+p2-1)
.
Here is my code and output from R:
n=2
log.like=function(p)
{
lamda=p[n+1]
ll=0
for(i in 1:n){
temp = log(p[i])+lamda*p[i]-lamda/(n)
ll=ll+temp
}
return(-ll)
}
mle = optimx(c(0.48,.52,-1.5),
log.like,
lower=c(rep(0.1,2),-3),
upper=c(rep(.9,2),-1),
method = "L-BFGS-B")
> mle
par fvalues method fns grs itns conv KKT1 KKT2 xtimes
1 0.9, 0.9, -1.0 1.010721 L-BFGS-B 8 8 NULL 0 FALSE NA 0
The solution to the equation when n=2
is p1=p2=1/2
and lamda=-2
. However, I do not get this when using optimx. Any idea?
Nothing wrong with optimx
. Take a step back and look at the function you want to maximize: log(p1) + log(p2) + lamda*(p1+p2-1)
. It's quite intuitive that the optimal solution is to make all variables as large as possible, no? See that optimx
rightfully returned the upper bounds you specified.
So what is wrong with your approach? When using Lagrange multipliers, critical points are saddle points of your function above, and not local minima like optimx
would help you find. So you need to modify your problem in a such a way that these saddle points become local minima. This can be done by optimizing over the norm of the gradient, which is easy to compute analytically for your problem. There is a great example (with pictures) here:
http://en.wikipedia.org/wiki/Lagrange_multiplier#Example:_numerical_optimization.
For your problem:
grad.norm <- function(x) {
lambda <- tail(x, 1)
p <- head(x, -1)
h2 <- sum((1/p + lambda)^2) + (sum(p) - 1)^2
}
optimx(c(.48, .52, -1.5),
grad.norm,
lower = c(rep(.1, 2), -3),
upper = c(rep(.9, 2), -1),
method = "L-BFGS-B")
# par fvalues method fns grs [...]
# 1 0.5000161, 0.5000161, -1.9999356 1.038786e-09 L-BFGS-B 13 13 [...]
Follow up: If you do not want to or cannot compute the gradient yourself, you can let R compute a numerical approximation, for example:
log.like <- function(x) {
lambda <- tail(x, 1)
p <- head(x, -1)
return(sum(log(p)) + lambda*(sum(p) - 1))
}
grad.norm <- function(x) {
require(numDeriv)
return(sum(grad(log.like, x)^2))
}
optimx(c(.48, .52, -1.5),
grad.norm,
lower = c(rep(.1, 2), -3),
upper = c(rep(.9, 2), -1),
method = "L-BFGS-B")
# par fvalues method fns grs [...]
# 1 0.5000161, 0.5000161, -1.9999356 1.038784e-09 L-BFGS-B 13 13 [...]