python scikit-learn logistic-regression encoder

Accuracy score for sklearn not returning a value

I have a dataset and have one hot encoded the target column (5 different strings throughout the entire column) using pd.get_dummies. I have then used sklearn's train_test_split function to create the training, testing and validation sets. The training set (features) were then normalized with standardScalar(). I have fit the training sets of both the features and the target to a logistic regression model.

I am now trying to calculate the accuracy score for the training, validation and test sets but am having no luck. My code up to this part is below:

dataset = pd.read_csv('tabular_data/clean_tabular_data.csv')
features, label = load_airbnb(dataset, 'Category')
label_series = dataset['Category']

label_encoded = pd.get_dummies(label_series)

X_train, X_test, y_train, y_test = train_test_split(features, label_encoded, test_size=0.3)
X_test, X_validation, y_test, y_validation = train_test_split(X_test, y_test, test_size=0.5)


# normalize the features 
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validation_scaled = scaler.transform(X_validation)
X_test_scaled = scaler.transform(X_test)

# get baseline classification model
model = LogisticRegression()
y_train = y_train.iloc[:, 0]
model.fit(X_train_scaled, y_train)

y_train_pred = model.predict(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=0) 
y_validation_pred = model.predict(X_validation_scaled)
y_validation_pred = np.argmax(y_validation_pred, axis =0)
y_test_pred = model.predict(X_test_scaled)
y_test_pred = np.argmax(y_test_pred, axis = 0)

# evaluate model using accuracy
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
validation_acc = accuracy_score(y_validation, y_validation_pred)

The error I am getting is here: "File "C:\Users\lcox1\Documents\VSCode\AiCore\Data science\classification_prac.py", line 56, in train_acc = accuracy_score(y_train, y_train_pred)

TypeError: Singleton array 16 cannot be considered a valid collection."

I am fairly new to python so have no idea what the issue is. Any help appreciated.

Solution

You are getting that error because of these lines:

y_train_pred = model.predict(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=0)

When you call model.predict(), it actually returns you an array of predicted labels, and not the probabilities. And if you do argmax of this array, you get 1 value, which is the index of the maximum value, hence it throws you the error, during prediction.

Most likely you mean to do:

y_train_pred = model.predict_proba(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=1) 
y_train_pred

As @BenReiniger pointed out in the comments, if you are trying to train a model on multi class labels, you should not one-hot encode. Try something below, where I used an example dataset, and have the labels as a category:

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

data = load_iris()
features = data.data
label_series = pd.Series(data.target).map({0:"setosa",1:"virginica",2:"versicolor"})
label_series = pd.Categorical(label_series)

le = LabelEncoder()
label_encoded = le.fit_transform(label_series)

Running your code with some changes:

X_train, X_test, y_train, y_test = train_test_split(features, label_encoded, test_size=0.3)
X_test, X_validation, y_test, y_validation = train_test_split(X_test, y_test, test_size=0.5)
 
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validation_scaled = scaler.transform(X_validation)
X_test_scaled = scaler.transform(X_test)

# get baseline classification model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

y_train_pred = model.predict_proba(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=1) 
y_validation_pred = model.predict_proba(X_validation_scaled)
y_validation_pred = np.argmax(y_validation_pred, axis =1)
y_test_pred = model.predict_proba(X_test_scaled)
y_test_pred = np.argmax(y_test_pred, axis = 1)

# evaluate model using accuracy
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
validation_acc = accuracy_score(y_validation, y_validation_pred

The results:

print(train_acc,test_acc,validation_acc)
0.9809523809523809 0.9090909090909091 1.0