Find OLS solution using optim

I am trying to find the solution of a linear model using optim. I could use lm or even use matrix multiplication to find explicit solution. However, the end goal is to use different objective functions. Since I can't replicate the lm solution, there is something wrong with my code and I cannot figure it out. My linear model does not have an intercept term.

Here is an example

RSS_f <- function(x){
  f_hat=as.matrix(dat[,2:ncol(dat)])%*%(as.matrix(x))
  error=dat[,1] - f_hat
  return(sum(error^2))
}

x0=rep(0,6)

r<- optim(x0,RSS_f)

Here is the sample data I have been using

dat=structure(list(y = c(-0.015592301115793, -0.257703905025368, 
0.132944891819098, 0.144585602249198, -0.0834638586276937, -0.132506252754974, 
-0.0671452738850035, 0.017093324992248), x1 = c(0.0762815118005244, 
0.0684222303897641, 0.0126545677701762, -0.0283951453807663, 
0.0682085328963522, 0.030322515917991, -0.0288021662870674, 0.0213481986485853
), x2 = c(0.121096462766523, 0.173507656658378, 0.0830579904733009, 
-0.0131531465094037, 0.14514027572694, -0.00235172620017821, 
-0.116782685211159, 0.128253294500769), x3 = c(0.00271738492481077, 
0.199849189671841, -0.149623094678353, -0.0375499460631867, 0.035188928292329, 
0.0421848373709954, -0.212940391685557, -0.280765720194465), 
    x4 = c(-0.661111149054194, -1.62711136938006, -0.16384009438204, 
    1.24217835189415, -1.01181978837279, -0.968527314213574, 
    -0.271489340669151, -0.602168687364268), x5 = c(1.6840335520254, 
    0.0085051611667053, 0.803743907332177, -1.86248825400901, 
    1.67581124074128, -0.0919360291035565, -0.210122962144954, 
    1.96152949931911), x6 = c(0.251747455313378, -0.353511278663921, 
    -0.155085549001921, -0.376159415130184, -0.0614334625077317, 
    0.759475827597367, 0.10074325959879, 0.512321974431873)), row.names = 63:56, class = "data.frame")

Solution

I see one obvious problem here, and encountered several non-obvious problems as I moved forward.

you forgot to include an intercept in your model matrix (your edit makes clear that this is not the problem — it would have saved time to mention this in the first place!)
when fitting this model with the default (Nelder-Mead) algorithm from these starting points, it takes more than the default number of iterations (500) to converge.
your example set of data/predictors is close to multicollinear, which makes naive/"brute force" optimization approaches unstable

Slightly reorganized objective function (the only important difference is cbind(1, ...) to add an intercept column to the model matrix: point (1) above)

RSS_f <- function(beta, dat){
    ## inefficient, should compute these outside the loop, but ...
    X <- cbind(1, as.matrix(dat[,-1]))
    y <- dat[,1]
    f_ha <- X %*% beta
    return(sum((y - f_ha)^2))
}

x0 <- rep(0,7)  ## need one additional element in the vector for the intercept
r <- optim(x0, RSS_f, dat=dat, control = list(maxit = 1e5))

Points 2 and 3 are related to @ZheyuanLi's answer (i.e., picking a good starting point will mitigate them).

Point (2): if we run this optim() without the control(list(maxit = 1e5)), it fails to converge — but silently. You have to look at r$convergence, see that the value is 1 and not 0 (zero = "converged"), and then go look at ?optim to see that r$convergence == 1 means "maximum number of iterations observed".

Point (3): if you look at the coefficients, you'll see that they still don't quite match lm. This data set has 8 observations and 7 predictors, which makes the OLS fit pretty ambitious (although not impossible). I tried a few other options (e.g. method = "BFGS" in optim()), which didn't help much.

You can get an idea of the stability of the OLS problem by looking at the ratio of the largest to the smallest eigenvalue of crossprod(X) (i.e., t(X) %*% X); in this case it's a factor of about 1e5 (bad news).

X <- cbind(1, as.matrix(dat[,-1]))
eigen(crossprod(X))$values
[1] 2.034880e+01 6.082053e+00 2.730513e+00 9.748323e-01 8.148341e-02
[6] 1.358231e-02 9.186567e-05

Alternatively, compute the condition number: Matrix::condest(crossprod(X)) is 2.75e5 (again, a large value is bad).

For comparison, if I generate predictors randomly (m <- matrix(rnorm(48), nrow = 8)), add an intercept column, and compute the condition number of crossprod(m), the values are typically on the order of 500 (1000 random replicates ranged from about 20 to 30,000). So part of the problem is 'not much data for a model with this many predictors', but your example is even worse than a randomly sampled one.

R's built-in method for fitting linear models (QR with a custom pivoting strategy) works robustly even on such ill-posed problems.

I don't think removing the intercept will change the ill-conditioning problem very much, although I haven't checked.