pythonscikit-learnsvmlibsvmliblinear

difference between penalty and loss parameters in Sklearn LinearSVC library


I'm not very familiar with SVM Theory and I'm using this LinearSVC class in python:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

I was wondering what is the difference between penalty and loss parameters?


Solution

  • In machine learning, loss function measures the quality of your solution, while penalty function imposes some constraints on your solution.

    Specifically, Let X be your data, and y be labels of your data. Then loss function V(f(X),y) measures how well your model f maps your data to the labels. Here, f(X) is a vector of predicted labels.

    L1 and L2 norms are commonly used and intuitively understood loss functions (see *). L1 norm: V(f(X),y) = |f(x1) - y1| + ... + |f(xn) - yn|, where f(xi) - the predicted label of the i-th object, and yi is the actual label. L2 norm: V(f(X),y) = sqrt(|f(x1) - y1|^2 + ... + |f(xn) - yn|^2) , where sqrt is square root.

    As for penalty function, it is used to impose some constraints R(f) on the your solution f. The L1 norm could be R(f)=|f1| + ... + |fm|, and similarly you can define L2 norm. Here, f1,..., fm are the coefficients of the model. You don't know them initially, these are the values that are being learned from your data by a machine learning algorithm.

    Eventually, the overall cost function is V(f(X),y) + lambda*R(f). And the goal is to find f which would minimize your cost function. Then this f will be used to make predictions for the new unseen objects. Why do we need a penalty function? It turns out, penalty function may add some nice properties to your solution. For example, when you have too many features, L1 norm helps to prevent overfitting, by generating sparse solutions.

    * This is not exactly how support vector machines works, but might give you some idea of what these terms mean. For example, in SVM, L1-hinge loss and L2-hinge loss functions are used. L1-hinge: V(f(X),y) = max(0,1 - y1*f(x1)) + ... + max(0,1 - yn*f(xn)), and L2 is similar but with squared terms. You may find a good introduction to ML in Machine Learning class by Andrew Ng on Coursera