I'm currently evaluating which classifier have the best performance for movie reviews sentiment analysis task. So far I have evaluate Logistic Regression, Linear Regression, Random Forest and Decision tree but I also want to consider KNN, GNB and SVC models as well. The problem is that each execution of those algorithms (particulary KNN) has a large exec time. Even using RandomizedSearch in KNN I have to wait about 1 hour with 10 iterations. Here are some snippets:
KNN Classifier
#KNearestNeighbors X -> large execution time
knn=KNeighborsClassifier()
k_range=list(range(1,50))
options=['uniform', 'distance']
param_grid = dict(n_neighbors=k_range, weights=options)
rand_knn = RandomizedSearchCV(knn, param_grid, cv=10, scoring='accuracy', n_iter=10, random_state=0)
rand_knn.fit(x_train_bow, y_train)
print(rand_knn.best_score_)
print(rand_knn.best_params_)
confm_knn = confusion_matrix(y_test, y_pred_knn)
print_confm(confm_knn)
print("=============K NEAREST NEIGHBORS============")
print_metrics(y_test,y_pred_knn)
print("============================================")
I waited for the execution of the code above for about 85 minutes but it never finished and I had to cut the execution. In order to get any result (at least anything) I try to choose the best k manually with a for loop but still each iteration takes over 12 - 17 minutes.
def testing_k_neighbors(x_train_bow,y_train,x_test_bow,y_test):
accuracy_hist = []
for i in range (1,21):
knn=KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train_bow, y_train)
yi_pred_knn = knn.predict(x_test_bow)
acc_i = accuracy_score(y_test, yi_pred_knn)
accuracy_hist.append(acc_i)
print(f"K: {i}, accuracy: {acc_i}")
print(accuracy_hist)
output:
K: 1, accuracy: 0.7384384634613782
K: 2, accuracy: 0.7435213732188984
K: 3, accuracy: 0.7574368802599784
K: 4, accuracy: 0.7678526789434214
K: 5, accuracy: 0.7681859845012916
K: 6, accuracy: 0.7745187901008249
K: 7, accuracy: 0.7729355887009416
K: 8, accuracy: 0.7774352137321889
K: 9, accuracy: 0.7742688109324223
K: 10, accuracy: 0.7810182484792934
K: 11, accuracy: 0.7776851929005916
K: 12, accuracy: 0.7854345471210732
K: 13, accuracy: 0.783101408215982
K: 14, accuracy: 0.7866844429630864
K: 15, accuracy: 0.784934588784268
K: 16, accuracy: 0.78860094992084
K: 17, accuracy: 0.7873510540788268
K: 18, accuracy: 0.7893508874260479
K: 19, accuracy: 0.7856011999000083
K: 20, accuracy: 0.7916006999416715
Also SVC and GNB takes similar time to get any result:
#Support Vector Macine X -> large execution time
#svc=SVC(C = 100, kernel = 'linear', random_state=123)
#svc.fit(x_train_bow,y_train)
#y_pred_svc = svc.predict(x_test_bow)
#print("=============SUPPORT VECTOR MACHINE============")
#print_metrics(y_test,y_pred_svc)
#print("============================================")
#Gaussian Naive Bayes
gnbc=GaussianNB()
gnbc.fit(x_train_bow.toarray(),y_train)
y_pred_gnbc = gnbc.predict(x_test_bow)
print("=============GAUSSIAN NAIVE BAYES============")
print_metrics(y_test,y_pred_gnbc)
print("============================================")
Is there any way to tune my code reduce execution time and mantain or improve models performance?
Im expecting to tune my code prioritzing both efficiency and performance
i try your code: then i print "x_train_bow":
<28000x122447 sparse matrix of type '<class 'numpy.float64'>'
with 2796291 stored elements in Compressed Sparse Row format>
you have 122447 columns then used TfidfVectorizer, This is a problem of dimension, which is why it takes a lot of time. There is no solution(KNN, SVC, trees). you need to reduce the dimension. You need to extract the corresponding words and then use TfidfVectorizer.