I am performing a simple classification using SKLearn's LinearSVC (LibLinear).
I cannot directly reproduce the predicted values and get the same accuracy as the "LinearSVC.predict" does.
What am I doing wrong? The following code is stand-alone and highlights my problem.
import scipy as sc
import numpy as np
from sklearn.svm import LinearSVC #liblinear
N=6000
m=500
D = sc.sparse.random(N,m, random_state = 1)
D.data *= 2
D.data -= 1
X = sc.sparse.csr_matrix(D)
y = (X.sum(axis = 1) > .0)*2-1.0
x_train = X[:5000,:]
y_train = y[:5000,:]
x_test = X[5000:,:]
y_test = y[5000:,:]
clf = LinearSVC(C=.1, fit_intercept = False, loss= 'hinge')
clf.fit(x_train,np.array(y_train))
print "Direct prediction accuracy:\t",100-100*np.mean((np.sign(x_test*clf.coef_.T)!=y_test)+0.0) ,"%"
print "CLF prediction accuracy:\t", 100*clf.score(x_test,y_test),"%"
Output:
Direct prediction accuracy: 90.8 %
CLF prediction accuracy: 91.3 %
Thanks for any help!
The difference comes from how you treat zeros, when using np.sign
you have zeros in the result which are not classified to any valid classes (1 or -1 since you have a binary classifier); The Classifier.predict on the other hand strictly outputs two classes; A tiny twist of your prediction method from np.sign(x_test*clf.coef_.T)
to (np.where(x_test * clf.coef_.T > 0, 1, -1)
will give exactly the same accuracy as the built in predict method:
ā
print "Direct prediction accuracy:\t", 100-100*np.mean((np.where(x_test * clf.coef_.T > 0, 1, -1) != y_test)+0.0) ,"%"
print "CLF prediction accuracy:\t", 100*clf.score(x_test, y_test),"%"
# Direct prediction accuracy: 92.7 %
# CLF prediction accuracy: 92.7 %