Why does optimx in R not give the correct solution to this simple nonparametric likelihood maximization?

Is optimx() providing incorrect solution or am I missing a simple point? Thank you!

I am trying to maximize a very simple likelihood. This is a non-parametric likelihood in the sense that the distribution of F is not specified parametrically. Rather, for each observed xi, f(xi)=pi and thus log(Likelihood)=Sum(log(f(xi)))=Sum(log(pi)).

The function I am trying to maximize is: sum(log(pi))+lamda(sum(pi-1)) where sum(pi)=1 (i.e. this is a constrained maximization problem which can be solved using Lagrange multiplier).

The answer to this problem is pi=1/n where n is the number of data points. However, optimx does not seem to give this solution. Does anybody have any idea. If n=2, the function I am maximizing is log(p1)+log(p2)+lamda(p1+p2-1).

Here is my code and output from R:

n=2
log.like=function(p)
{
  lamda=p[n+1]
  ll=0
  for(i in 1:n){
    temp = log(p[i])+lamda*p[i]-lamda/(n)
    ll=ll+temp
  }
  return(-ll)
}


mle = optimx(c(0.48,.52,-1.5),
             log.like,
             lower=c(rep(0.1,2),-3),
             upper=c(rep(.9,2),-1),
             method = "L-BFGS-B")

> mle
             par  fvalues   method fns grs itns conv  KKT1 KKT2 xtimes
1 0.9, 0.9, -1.0 1.010721 L-BFGS-B   8   8 NULL    0 FALSE   NA      0

The solution to the equation when n=2 is p1=p2=1/2 and lamda=-2. However, I do not get this when using optimx. Any idea?

Solution

Nothing wrong with optimx. Take a step back and look at the function you want to maximize: log(p1) + log(p2) + lamda*(p1+p2-1). It's quite intuitive that the optimal solution is to make all variables as large as possible, no? See that optimx rightfully returned the upper bounds you specified.

So what is wrong with your approach? When using Lagrange multipliers, critical points are saddle points of your function above, and not local minima like optimx would help you find. So you need to modify your problem in a such a way that these saddle points become local minima. This can be done by optimizing over the norm of the gradient, which is easy to compute analytically for your problem. There is a great example (with pictures) here:

http://en.wikipedia.org/wiki/Lagrange_multiplier#Example:_numerical_optimization.

For your problem:

grad.norm <- function(x) {
  lambda <- tail(x, 1)
  p <- head(x, -1)
  h2 <- sum((1/p + lambda)^2) + (sum(p) - 1)^2
}

optimx(c(.48, .52, -1.5),
       grad.norm,
       lower = c(rep(.1, 2), -3),
       upper = c(rep(.9, 2), -1),
       method = "L-BFGS-B")

#                               par      fvalues   method  fns grs [...]
# 1 0.5000161, 0.5000161, -1.9999356 1.038786e-09 L-BFGS-B  13  13 [...]

Follow up: If you do not want to or cannot compute the gradient yourself, you can let R compute a numerical approximation, for example:

log.like <- function(x) {
  lambda <- tail(x, 1)
  p <- head(x, -1)
  return(sum(log(p)) + lambda*(sum(p) - 1))
}

grad.norm <- function(x) {
  require(numDeriv)
  return(sum(grad(log.like, x)^2))
}

optimx(c(.48, .52, -1.5),
       grad.norm,
       lower = c(rep(.1, 2), -3),
       upper = c(rep(.9, 2), -1),
       method = "L-BFGS-B")

#                                par      fvalues   method fns grs [...]
# 1 0.5000161, 0.5000161, -1.9999356 1.038784e-09 L-BFGS-B  13  13 [...]