pythonmachine-learningnlpsentiment-analysissklearn-pandas

How can I optimize KNN, GNB nd SVC sklearn algorithms to reduce exec time?


I'm currently evaluating which classifier have the best performance for movie reviews sentiment analysis task. So far I have evaluate Logistic Regression, Linear Regression, Random Forest and Decision tree but I also want to consider KNN, GNB and SVC models as well. The problem is that each execution of those algorithms (particulary KNN) has a large exec time. Even using RandomizedSearch in KNN I have to wait about 1 hour with 10 iterations. Here are some snippets:

KNN Classifier

 #KNearestNeighbors X -> large execution time
    knn=KNeighborsClassifier()
    k_range=list(range(1,50))
    options=['uniform', 'distance']
    param_grid = dict(n_neighbors=k_range, weights=options)
    rand_knn = RandomizedSearchCV(knn, param_grid, cv=10, scoring='accuracy', n_iter=10, random_state=0)
    rand_knn.fit(x_train_bow, y_train)
    print(rand_knn.best_score_)
    print(rand_knn.best_params_)
    confm_knn = confusion_matrix(y_test, y_pred_knn)
    print_confm(confm_knn)
    print("=============K NEAREST NEIGHBORS============")
    print_metrics(y_test,y_pred_knn)
    print("============================================")

I waited for the execution of the code above for about 85 minutes but it never finished and I had to cut the execution. In order to get any result (at least anything) I try to choose the best k manually with a for loop but still each iteration takes over 12 - 17 minutes.

def testing_k_neighbors(x_train_bow,y_train,x_test_bow,y_test):
    accuracy_hist = []
    for i in range (1,21):
        knn=KNeighborsClassifier(n_neighbors=i)
        knn.fit(x_train_bow, y_train)
        yi_pred_knn = knn.predict(x_test_bow)
        acc_i = accuracy_score(y_test, yi_pred_knn)
        accuracy_hist.append(acc_i)
        print(f"K: {i}, accuracy: {acc_i}")
    print(accuracy_hist)

output:

K: 1, accuracy: 0.7384384634613782
K: 2, accuracy: 0.7435213732188984
K: 3, accuracy: 0.7574368802599784
K: 4, accuracy: 0.7678526789434214
K: 5, accuracy: 0.7681859845012916
K: 6, accuracy: 0.7745187901008249
K: 7, accuracy: 0.7729355887009416
K: 8, accuracy: 0.7774352137321889
K: 9, accuracy: 0.7742688109324223
K: 10, accuracy: 0.7810182484792934
K: 11, accuracy: 0.7776851929005916
K: 12, accuracy: 0.7854345471210732
K: 13, accuracy: 0.783101408215982
K: 14, accuracy: 0.7866844429630864
K: 15, accuracy: 0.784934588784268
K: 16, accuracy: 0.78860094992084
K: 17, accuracy: 0.7873510540788268
K: 18, accuracy: 0.7893508874260479
K: 19, accuracy: 0.7856011999000083
K: 20, accuracy: 0.7916006999416715

Also SVC and GNB takes similar time to get any result:

    #Support Vector Macine  X -> large execution time
    #svc=SVC(C = 100, kernel = 'linear', random_state=123)
    #svc.fit(x_train_bow,y_train)
    #y_pred_svc = svc.predict(x_test_bow)
    #print("=============SUPPORT VECTOR MACHINE============")
    #print_metrics(y_test,y_pred_svc)
    #print("============================================")   
    #Gaussian Naive Bayes
    gnbc=GaussianNB()
    gnbc.fit(x_train_bow.toarray(),y_train)
    y_pred_gnbc = gnbc.predict(x_test_bow)
    print("=============GAUSSIAN NAIVE BAYES============")
    print_metrics(y_test,y_pred_gnbc)
    print("============================================")   

Is there any way to tune my code reduce execution time and mantain or improve models performance?

Im expecting to tune my code prioritzing both efficiency and performance


Solution

  • i try your code: then i print "x_train_bow":

    <28000x122447 sparse matrix of type '<class 'numpy.float64'>'
        with 2796291 stored elements in Compressed Sparse Row format>
    

    you have 122447 columns then used TfidfVectorizer, This is a problem of dimension, which is why it takes a lot of time. There is no solution(KNN, SVC, trees). you need to reduce the dimension. You need to extract the corresponding words and then use TfidfVectorizer.