I used Spark 2.4 and previous versions, and spark mllib is using SGD for regression problems as optimizer, and there are LinearRegressionWithSGD
and LassoWithSGD
, which use SGD as the optimizer.
It looks to me that Spark is using L-BFGS
, Normal equation solver for weighted least squares
,Iteratively reweighted least squares (IRLS)
as optimizer now (https://spark.apache.org/docs/latest/ml-advanced.html).
I would ask why Spark gave up SGD and what's the motivation for spark to move to optimizers mentioned above. Are there limitations with SGD?
Thanks
This explanation is for Spark version 3.4.1, the latest version at the time of writing this post.
If you look at this doc page, you see that SGD is actually still supported for linear methods (like those you mentioned) and is actually the one that most algorithms implement:
Under the hood, linear methods use convex optimization methods to optimize the objective functions. spark.mllib uses two methods, SGD and L-BFGS, described in the optimization section. Currently, most algorithm APIs support Stochastic Gradient Descent (SGD), and a few support L-BFGS. Refer to this optimization section for guidelines on choosing between optimization methods.
You can indeed see in the source code that these are still supported for your examples: LinearRegressionWithSGD
and LassoWithSGD
.
Now, even though these are still supported, I did find some hints as to why you could prefer to use L-BFGS over SGD.
From the docs:
Linear methods use optimization internally, and some linear methods in spark.mllib support both SGD and L-BFGS. Different optimization methods can have different convergence guarantees depending on the properties of the objective function, and we cannot cover the literature here. In general, when L-BFGS is available, we recommend using it instead of SGD since L-BFGS tends to converge faster (in fewer iterations).
And also this commit contains some motivation for it. The relevant part of that commit message:
The PR includes the tests which compare the result with SGD with/without regularization.
We did a comparison between LBFGS and SGD, and often we saw 10x less steps in LBFGS while the cost of per step is the same (just computing the gradient).
The following is the paper by Prof. Ng at Stanford comparing different optimizers including LBFGS and SGD. They use them in the context of deep learning, but worth as reference. http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf
Of course, this is always dependent on your exact case but this might give more context to what you were originally wondering about.