python-3.xpandasmachine-learningscikit-learnlinear-regression

The feature names should match those that were passed during fit


Im trying to calculate the r squared value after the creation of a model using sklearn linear regression.

Im simply

  1. importing a csv dataset
  2. filtering the interesting columns
  3. splitting the dataset in train and test
  4. creating the model
  5. making a prediction on the test
  6. calculating the r squared in order to see how good is the model to fit the test dataset

the dataset is taken from https://www.kaggle.com/datasets/jeremylarcher/american-house-prices-and-demographics-of-top-cities

the code is as following

''' Lets verify if there s a correlation between price and beds number of bathroom'''

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.read_csv('data/American_Housing_Data_20231209.csv')

df_interesting_columns = df[['Beds', 'Baths', 'Price']]

independent_variables = df_interesting_columns[['Beds', 'Baths']]
dependent_variable = df_interesting_columns[['Price']]

X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

prediction = model.predict(X_test)

print(model.score(y_test, prediction))

but i get the error

ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time:

what am I doing wrong?


Solution

  • Your last line is wrong. You misunderstood the score method. score take X and y as parameter not the y_true and y_pred

    Try:

    from sklearn.metrics import r2_score
    
    print(r2_score(y_test, prediction))
    # 0.24499127100887863
    

    Or with the score method:

    print(model.score(X_test, y_test))
    # 0.24499127100887863