pythonpandasscikit-learn

Pandas + sklearn Linear regression fails


I am trying to implement some linear regression model in Python. See the code below, which I've used to make a linear regression.

import pandas
salesPandas = pandas.DataFrame.from_csv('home_data.csv')

# check the shape of the DataFrame (rows, columns)
salesPandas.shape
(21613, 20)

from sklearn.cross_validation import train_test_split

train_dataPandas, test_dataPandas = train_test_split(salesPandas, train_size=0.8, random_state=1)

from sklearn.linear_model import LinearRegression

reg_model_Pandas = LinearRegression()

print type(train_dataPandas)
print train_dataPandas.shape
<class 'pandas.core.frame.DataFrame'>
(17290, 20)

print type(train_dataPandas['price'])
print train_dataPandas['price'].shape
<class 'pandas.core.series.Series'>
(17290L,)

X = train_dataPandas
y = train_dataPandas['price']
reg_model_Pandas.fit(X, y)

After I've executed the python code above, the following error appears:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-dc363e199032> in <module>()
      3 X = train_dataPandas
      4 y = train_dataPandas['price']
----> 5 reg_model_Pandas.fit(X, y)

C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, n_jobs)
    374             n_jobs_ = self.n_jobs
    375         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 376                          y_numeric=True, multi_output=True)
    377 
    378         X, y, X_mean, y_mean, X_std = self._center_data(

C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric)
    442     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    443                     ensure_2d, allow_nd, ensure_min_samples,
--> 444                     ensure_min_features)
    445     if multi_output:
    446         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)
    342             else:
    343                 dtype = None
--> 344         array = np.array(array, dtype=dtype, order=order, copy=copy)
    345         # make sure we actually converted to numeric:
    346         if dtype_numeric and array.dtype.kind == "O":

ValueError: invalid literal for float(): 20140610T000000

Output from train_dataPandas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17290 entries, 4058200630 to 1762600320
Data columns (total 20 columns):
date             17290 non-null object
price            17290 non-null int64
bedrooms         17290 non-null int64
bathrooms        17290 non-null float64
sqft_living      17290 non-null int64
sqft_lot         17290 non-null int64
floors           17290 non-null float64
waterfront       17290 non-null int64
view             17290 non-null int64
condition        17290 non-null int64
grade            17290 non-null int64
sqft_above       17290 non-null int64
sqft_basement    17290 non-null int64
yr_built         17290 non-null int64
yr_renovated     17290 non-null int64
zipcode          17290 non-null int64
lat              17290 non-null float64
long             17290 non-null float64
sqft_living15    17290 non-null int64
sqft_lot15       17290 non-null int64
dtypes: float64(4), int64(15), object(1)
memory usage: 2.8+ MB

Solution

  • Another possible solution based on your data could be to specify parse_dates when reading the date from file as such:

    import pandas
    salesPandas = pandas.read_csv('home_data.csv', parse_dates=['date'])
    

    The reason why this would be helpful is when you pass your data to be fitted you can break it up into month, hour, day. This is assuming most of your data is concentrated on those previously mentioned and not on years (i.e. your total unique years is about 3-4)

    From here you can use Datetimelike Properties and call the month by doing salesPandas['date'].dt.month, then for day and hour just replace it accordingly.