I am learning about sklearn especially polynomial model fitting.
Using the PolynomialFeatures
function to a 2nd degree polynomial, there is something I am not understanding about how the LinearRegression()
functionality expects to read in data based on the dataframe dimensions. Here is the error message I keep getting, followed by the code to replicate:
ValueError: X has 4 features, but LinearRegression is expecting 14 features as input.
Here is the code to replicate:
# Create dataframes
Dum_data = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
Dum_data_y = pd.DataFrame([[13],[14],[15]])
#Fit to 2 degree polynomial
poly_fit = PolynomialFeatures(degree = 2, include_bias = False)
Dum_poly = poly_fit.fit_transform(Dum_data)
print(Dum_data.shape, Dum_data_y.shape)
# #Fit the linear model to this
modl = LinearRegression()
modl.fit(Dum_poly, Dum_data_y)
# #Now get the predictions
Dum_y_pred = modl.predict(Dum_data)
I see a similar issue here converting to a numpy array and reshaping, but in the guides I am trying to use with polynomial regression...using scikit-learn and Multivariate regression with Python they seem to be passing in dataframes. I know I need to use the .reshape()
function in some capacity, but after toying around with different dimensions of data, I cannot tell how to determine what number of features are expected. Thanks!
You can modify the final line of the code as follows:
Dum_y_pred = modl.predict(Dum_poly)
The original data contains 4 features: x1, x2, x3, and x4.
When you apply the second order of PolynomialFeatures, it adds 10 more features: x1x1, x1x2, x1x3, x1x4, x2x2, x2x3, x2x4, x3x3, x3x4, and x4x4.
In total, this results in 14 features to train your model. Therefore, your model can only accept data with these 14 features.