pandasscikit-learnkeyerrortrain-test-split

LinearRegression model gives me keyError 0


I have written some code for a linear regression model to predict house prices. I'm witting exactly the same as a tutorial video; when I write random_state=42 it works without any error, but when I change the random_state to any other number it give this error.

Here is the code:

from sklearn.model_selection import train_test_split  
X = data.drop('SalesPrice', axis = 1)
y = data['SalesPrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

predictions = lr.predict(X_test)

print("Actual value of the house: ", y_test[0])
print("Model prediction value: ", predictions[0])

and this is the error:

KeyError                                  Traceback (most recent call last)
File C:\ProgramData\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File C:\ProgramData\anaconda3\Lib\site-packages\pandas\_libs\index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File C:\ProgramData\anaconda3\Lib\site-packages\pandas\_libs\index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:2606, in pandas._libs.hashtable.Int64HashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:2630, in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[66], line 3
      1 predictions = lr.predict(X_test)
----> 3 print("Actual value of the house: ", y_test[0])
      4 print("Model prediction value: ", predictions[0])

File C:\ProgramData\anaconda3\Lib\site-packages\pandas\core\series.py:1007, in Series.__getitem__(self, key)
   1004     return self._values[key]
   1006 elif key_is_scalar:
-> 1007     return self._get_value(key)
   1009 if is_hashable(key):
   1010     # Otherwise index.get_value will raise InvalidIndexError
   1011     try:
   1012         # For labels that don't resolve as scalars like tuples and frozensets

File C:\ProgramData\anaconda3\Lib\site-packages\pandas\core\series.py:1116, in Series._get_value(self, label, takeable)
   1113     return self._values[label]
   1115 # Similar to Index.get_value, but we do not fall back to positional
-> 1116 loc = self.index.get_loc(label)
   1118 if is_integer(loc):
   1119     return self._values[loc]

File C:\ProgramData\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 0

Solution

  • As the traceback mentions, the error originates from print("Actual value of the house: ", y_test[0]).

    y_test[0] will only work when randomly 20% of the data also has the 0th index in it after train_test_split. That's why it works for some values of random_state and not for most.

    Generally you want to use either:

    y_test.to_list()[0]
    y_test.iloc[0]
    

    TLDR: Replace y_test[0] in your print stament