Im trying to calculate the r squared value after the creation of a model using sklearn linear regression.
Im simply
the dataset is taken from https://www.kaggle.com/datasets/jeremylarcher/american-house-prices-and-demographics-of-top-cities
the code is as following
''' Lets verify if there s a correlation between price and beds number of bathroom'''
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('data/American_Housing_Data_20231209.csv')
df_interesting_columns = df[['Beds', 'Baths', 'Price']]
independent_variables = df_interesting_columns[['Beds', 'Baths']]
dependent_variable = df_interesting_columns[['Price']]
X_train, X_test, y_train, y_test = train_test_split(independent_variables, dependent_variable, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print(model.score(y_test, prediction))
but i get the error
ValueError: The feature names should match those that were passed during fit. Feature names unseen at fit time:
what am I doing wrong?
Your last line is wrong. You misunderstood the score
method. score
take X
and y
as parameter not the y_true
and y_pred
Try:
from sklearn.metrics import r2_score
print(r2_score(y_test, prediction))
# 0.24499127100887863
Or with the score
method:
print(model.score(X_test, y_test))
# 0.24499127100887863