I'm trying to use the yellowbrick PredictionError and am running into strange dimensionality issues. I am using yellowbrick version 1.4.
Suppose we had this very simple linear regression:
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from yellowbrick.regressor import PredictionError, ResidualsPlot
X = pd.DataFrame({
"x1": np.linspace(1, 1000, 800),
"x2": np.linspace(2, 500, 800),
"x3": np.random.rand(800) * 50
})
y = pd.DataFrame().assign(y_val = 3 * X.x1 + 4 * X.x2 + X.x3)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
Now I want to run diagnostics. ResidualsPlot works easily, passing in the Pandas data structures unmodified:
rp = ResidualsPlot(model)
rp.fit(X_train, y_train)
rp.score(X_test, y_test)
rp.show()
# produces graphic (not shown)
However, when I try to use PredictionError:
pe = PredictionError(model)
pe.fit(X_train, y_train)
pe.score(X_test, y_test)
The call to score()
produces this error message:
File ~/venv/lib/python3.9/site-packages/yellowbrick/bestfit.py:141, in draw_best_fit(X, y, ax, estimator, **kwargs)
139 # Verify that y is a (n,) dimensional array
140 if y.ndim > 1:
--> 141 raise YellowbrickValueError(
142 "y must be a (1,) dimensional array not {}".format(y.shape)
143 )
145 # Uses the estimator to fit the data and get the model back.
146 model = estimator(X, y)
YellowbrickValueError: y must be a (1,) dimensional array not (264, 1)
Now I realize the type of y
is DataFrame
. If I change it to Series
, the code will work, e.g.:
# Same as before, for reference
y = pd.DataFrame().assign(y_val= 3 * X.x1 + 4 * X.x2 + X.x3)
# Change to Series here
y = y["y_val"]
The conversion to Series
certainly is a viable workaround but I'm wondering why it's the case here and not with ResidualsPlot
.
There is a draw_best_fit function that PredictionError accesses that checks to see if y is only one dimension and this function isn't used in ResidualPlot. Maybe you can submit a PR suggesting a fix. https://github.com/DistrictDataLabs/yellowbrick/