pythonscikit-learnlogistic-regressionencoder

Accuracy score for sklearn not returning a value


I have a dataset and have one hot encoded the target column (5 different strings throughout the entire column) using pd.get_dummies. I have then used sklearn's train_test_split function to create the training, testing and validation sets. The training set (features) were then normalized with standardScalar(). I have fit the training sets of both the features and the target to a logistic regression model.

I am now trying to calculate the accuracy score for the training, validation and test sets but am having no luck. My code up to this part is below:

dataset = pd.read_csv('tabular_data/clean_tabular_data.csv')
features, label = load_airbnb(dataset, 'Category')
label_series = dataset['Category']

label_encoded = pd.get_dummies(label_series)

X_train, X_test, y_train, y_test = train_test_split(features, label_encoded, test_size=0.3)
X_test, X_validation, y_test, y_validation = train_test_split(X_test, y_test, test_size=0.5)


# normalize the features 
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validation_scaled = scaler.transform(X_validation)
X_test_scaled = scaler.transform(X_test)

# get baseline classification model
model = LogisticRegression()
y_train = y_train.iloc[:, 0]
model.fit(X_train_scaled, y_train)

y_train_pred = model.predict(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=0) 
y_validation_pred = model.predict(X_validation_scaled)
y_validation_pred = np.argmax(y_validation_pred, axis =0)
y_test_pred = model.predict(X_test_scaled)
y_test_pred = np.argmax(y_test_pred, axis = 0)

# evaluate model using accuracy
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
validation_acc = accuracy_score(y_validation, y_validation_pred)

The error I am getting is here: "File "C:\Users\lcox1\Documents\VSCode\AiCore\Data science\classification_prac.py", line 56, in train_acc = accuracy_score(y_train, y_train_pred)

TypeError: Singleton array 16 cannot be considered a valid collection."

I am fairly new to python so have no idea what the issue is. Any help appreciated.


Solution

  • You are getting that error because of these lines:

    y_train_pred = model.predict(X_train_scaled)
    y_train_pred = np.argmax(y_train_pred, axis=0)
    

    When you call model.predict(), it actually returns you an array of predicted labels, and not the probabilities. And if you do argmax of this array, you get 1 value, which is the index of the maximum value, hence it throws you the error, during prediction.

    Most likely you mean to do:

    y_train_pred = model.predict_proba(X_train_scaled)
    y_train_pred = np.argmax(y_train_pred, axis=1) 
    y_train_pred
    

    As @BenReiniger pointed out in the comments, if you are trying to train a model on multi class labels, you should not one-hot encode. Try something below, where I used an example dataset, and have the labels as a category:

    from sklearn.model_selection import train_test_split
    import pandas as pd
    import numpy as np
    from sklearn.datasets import load_iris
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    from sklearn.preprocessing import LabelEncoder
    
    data = load_iris()
    features = data.data
    label_series = pd.Series(data.target).map({0:"setosa",1:"virginica",2:"versicolor"})
    label_series = pd.Categorical(label_series)
    
    le = LabelEncoder()
    label_encoded = le.fit_transform(label_series)
    

    Running your code with some changes:

    X_train, X_test, y_train, y_test = train_test_split(features, label_encoded, test_size=0.3)
    X_test, X_validation, y_test, y_validation = train_test_split(X_test, y_test, test_size=0.5)
     
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_validation_scaled = scaler.transform(X_validation)
    X_test_scaled = scaler.transform(X_test)
    
    # get baseline classification model
    model = LogisticRegression()
    model.fit(X_train_scaled, y_train)
    
    y_train_pred = model.predict_proba(X_train_scaled)
    y_train_pred = np.argmax(y_train_pred, axis=1) 
    y_validation_pred = model.predict_proba(X_validation_scaled)
    y_validation_pred = np.argmax(y_validation_pred, axis =1)
    y_test_pred = model.predict_proba(X_test_scaled)
    y_test_pred = np.argmax(y_test_pred, axis = 1)
    
    # evaluate model using accuracy
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    validation_acc = accuracy_score(y_validation, y_validation_pred
    

    The results:

    print(train_acc,test_acc,validation_acc)
    0.9809523809523809 0.9090909090909091 1.0