Meaning of `penalty` and `loss` in LinearSVC

Anti-closing preamble: I have read the question "difference between penalty and loss parameters in Sklearn LinearSVC library" but I find the answer there not to be specific enough. Therefore, I’m reformulating the question:

I am familiar with SVM theory and I’m experimenting with LinearSVC class in Python. However, the documentation is not quite clear regarding the meaning of penalty and loss parameters. I recon that loss refers to the penalty for points violating the margin (usually denoted by the Greek letter xi or zeta in the objective function), while penalty is the norm of the vector determining the class boundary, usually denoted by w. Can anyone confirm or deny this?

If my guess is right, then penalty = 'l1' would lead to minimisation of the L1-norm of the vector w, like in LASSO regression. How does this relate to the maximum-margin idea of the SVM? Can anyone point me to a publication regarding this question? In the original paper describing LIBLINEAR I could not find any reference to L1 penalty.

Also, if my guess is right, why doesn't LinearSVC support the combination of penalty='l2' and loss='hinge' (the standard combination in SVC) when dual=False? When trying it, I get the

ValueError: Unsupported set of arguments

Solution

Though very late, I'll try to give my answer. According to the doc, here's the considered primal optimization problem for LinearSVC: $\underset{w,b}{min} \frac{1}{2}w^Tw + C\sum_{i=1}^m max\left\{0, 1-y_i(w^T\phi(x_i) + b)\right \}$ ,phi being the Identity matrix, given that LinearSVC only solves linear problems.

Effectively, this is just one of the possible problems that LinearSVC admits (it is the L2-regularized, L1-loss in the terms of the LIBLINEAR paper) and not the default one (which is the L2-regularized, L2-loss). The LIBLINEAR paper gives a more general formulation for what concerns what's referred to as loss in Chapter 2, then it further elaborates also on what's referred to as penalty within the Appendix (A2+A4).

Basically, it states that LIBLINEAR is meant to solve the following unconstrained optimization pb with different loss functions xi(w;x,y) (which are hinge and squared_hinge); the default setting of the model in LIBLINEAR does not consider the bias term, that's why you won't see any reference to b from now on (there are many posts on SO on this).

$\xi (w;x_i,y_i) = max\left \{0, 1-y_iw^Tx_i \right \}$ , hinge or L1-loss
$\xi(w;x_i,y_i) = (max\left \{ 0, 1-y_iw^Tx_i \right \})^2$ , squared_hinge or L2-loss.

For what concerns the penalty, basically this represents the norm of the vector w used. The appendix elaborates on the different problems:

L2-regularized, L1-loss (penalty='l2', loss='hinge'): $\underset{w}{min} \frac{1}{2}w^Tw + C\sum_{i=1}^mmax\left\{0, 1-y_i(w^Tx_i)\right\}$
L2-regularized, L2-loss (penalty='l2', loss='squared_hinge'), default in LinearSVC: $\underset{w}{min} \frac{1}{2}w^Tw + C\sum_{i=1}^m(max\left\{0, 1-y_i(w^Tx_i)\right\})^2$
L1-regularized, L2-loss (penalty='l1', loss='squared_hinge'): $\underset{w}{min} \left \| w \right \|_1 + C\sum_{i=1}^m(max\left\{0, 1-y_i(w^Tx_i)\right\})^2$

Instead, as stated within the documentation, LinearSVC does not support the combination of penalty='l1' and loss='hinge'. As far as I see the paper does not specify why, but I found a possible answer here (within the answer by Arun Iyer).

Eventually, effectively the combination of penalty='l2', loss='hinge', dual=False is not supported as specified in here (it is just not implemented in LIBLINEAR) or here; not sure whether that's the case, but within the LIBLINEAR paper from Appendix B onwards it is specified the optimization pb that's solved (which in the case of L2-regularized, L1-loss seems to be the dual).

For a theoretical discussion on SVC pbs in general, I found that chapter really useful; it shows how the minimization of the norm of w relates to the idea of the maximum-margin.