I'm trying to implement linear regression on the California housing dataset, and I'm reading data as below:
data = pd.read_csv(r'C:\Users\California_Houses.csv',header=None)
print(data.shape)
output: (20640, 14)
print(data.head())
output:
0 1 2 3 4 \
0 Median_House_Value Median_Income Median_Age Tot_Rooms Tot_Bedrooms
1 452600 8.3252 41 880 129
2 358500 8.3014 21 7099 1106
3 352100 7.2574 52 1467 190
4 341300 5.6431 52 1274 235
5 6 7 8 9 \
0 Population Households Latitude Longitude Distance_to_coast
1 322 126 37.88 -122.23 9263.04077285038
2 2401 1138 37.86 -122.22 10225.7330715424
3 496 177 37.85 -122.24 8259.08510932293
4 558 219 37.85 -122.25 7768.0865708364
10 11 12 \
0 Distance_to_LA Distance_to_SanDiego Distance_to_SanJose
1 556529.1583418 735501.80698384 67432.5170008434
2 554279.850068765 733236.884360166 65049.9085739663
3 554610.717069378 733525.68293736 64867.2898334847
4 555194.266086292 734095.290744033 65287.1384120522
13
0 Distance_to_SanFrancisco
1 21250.2137667799
2 20880.6003997074
3 18811.4874496884
4 18031.0475677266
since columns names are coming as first row of data, i tried to remove it as below
data = data.iloc[1:,:]
Then trying to convert it to Numpy ndarray and reshape it:
x = np.array(data.iloc[1:,1:]).reshape(data.shape[0],data.shape[1]-1)
ValueError Traceback (most recent call last)
Input In [16], in <cell line: 10>()
8 print(data.all())
9 #y = np.array(data.iloc[1:,0]).reshape(data.shape[0],1)
---> 10 x = np.array(data.iloc[1:,1:]).reshape(data.shape[0],data.shape[1]-1)
11 #y = np.array(data.iloc[1:,0]).reshape(data.shape[0],1)
12 print(x.shape)
ValueError:
cannot reshape array of size 268320 into shape (20641,13)
Getting this error, please help.
Can you try below:
x = np.array(data.iloc[:,1:]).reshape(data.shape[0],data.shape[1]-1)