I am trying to implement some linear regression model in Python. See the code below, which I've used to make a linear regression.
import pandas
salesPandas = pandas.DataFrame.from_csv('home_data.csv')
# check the shape of the DataFrame (rows, columns)
salesPandas.shape
(21613, 20)
from sklearn.cross_validation import train_test_split
train_dataPandas, test_dataPandas = train_test_split(salesPandas, train_size=0.8, random_state=1)
from sklearn.linear_model import LinearRegression
reg_model_Pandas = LinearRegression()
print type(train_dataPandas)
print train_dataPandas.shape
<class 'pandas.core.frame.DataFrame'>
(17290, 20)
print type(train_dataPandas['price'])
print train_dataPandas['price'].shape
<class 'pandas.core.series.Series'>
(17290L,)
X = train_dataPandas
y = train_dataPandas['price']
reg_model_Pandas.fit(X, y)
After I've executed the python code above, the following error appears:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-dc363e199032> in <module>()
3 X = train_dataPandas
4 y = train_dataPandas['price']
----> 5 reg_model_Pandas.fit(X, y)
C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, n_jobs)
374 n_jobs_ = self.n_jobs
375 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 376 y_numeric=True, multi_output=True)
377
378 X, y, X_mean, y_mean, X_std = self._center_data(
C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric)
442 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
443 ensure_2d, allow_nd, ensure_min_samples,
--> 444 ensure_min_features)
445 if multi_output:
446 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)
342 else:
343 dtype = None
--> 344 array = np.array(array, dtype=dtype, order=order, copy=copy)
345 # make sure we actually converted to numeric:
346 if dtype_numeric and array.dtype.kind == "O":
ValueError: invalid literal for float(): 20140610T000000
Output from train_dataPandas.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17290 entries, 4058200630 to 1762600320
Data columns (total 20 columns):
date 17290 non-null object
price 17290 non-null int64
bedrooms 17290 non-null int64
bathrooms 17290 non-null float64
sqft_living 17290 non-null int64
sqft_lot 17290 non-null int64
floors 17290 non-null float64
waterfront 17290 non-null int64
view 17290 non-null int64
condition 17290 non-null int64
grade 17290 non-null int64
sqft_above 17290 non-null int64
sqft_basement 17290 non-null int64
yr_built 17290 non-null int64
yr_renovated 17290 non-null int64
zipcode 17290 non-null int64
lat 17290 non-null float64
long 17290 non-null float64
sqft_living15 17290 non-null int64
sqft_lot15 17290 non-null int64
dtypes: float64(4), int64(15), object(1)
memory usage: 2.8+ MB
Another possible solution based on your data could be to specify parse_dates
when reading the date from file as such:
import pandas
salesPandas = pandas.read_csv('home_data.csv', parse_dates=['date'])
The reason why this would be helpful is when you pass your data to be fitted you can break it up into month, hour, day. This is assuming most of your data is concentrated on those previously mentioned and not on years (i.e. your total unique years is about 3-4)
From here you can use Datetimelike Properties and call the month by doing salesPandas['date'].dt.month
, then for day and hour just replace it accordingly.