pythonpandasscikit-learnyellowbrick

Yellowbrick: PredictionError dimensionality issue


I'm trying to use the yellowbrick PredictionError and am running into strange dimensionality issues. I am using yellowbrick version 1.4.

Suppose we had this very simple linear regression:

import pandas as pd 
import numpy as np
import matplotlib as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from yellowbrick.regressor import PredictionError, ResidualsPlot

X = pd.DataFrame({
    "x1": np.linspace(1, 1000, 800),
    "x2": np.linspace(2, 500, 800),
    "x3": np.random.rand(800) * 50
})
y = pd.DataFrame().assign(y_val = 3 * X.x1 + 4 * X.x2 + X.x3)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

Now I want to run diagnostics. ResidualsPlot works easily, passing in the Pandas data structures unmodified:

rp = ResidualsPlot(model)
rp.fit(X_train, y_train)
rp.score(X_test, y_test)
rp.show()
# produces graphic (not shown)

However, when I try to use PredictionError:

pe = PredictionError(model)
pe.fit(X_train, y_train)
pe.score(X_test, y_test)

The call to score() produces this error message:

File ~/venv/lib/python3.9/site-packages/yellowbrick/bestfit.py:141, in draw_best_fit(X, y, ax, estimator, **kwargs)
    139 # Verify that y is a (n,) dimensional array
    140 if y.ndim > 1:
--> 141     raise YellowbrickValueError(
    142         "y must be a (1,) dimensional array not {}".format(y.shape)
    143     )
    145 # Uses the estimator to fit the data and get the model back.
    146 model = estimator(X, y)

YellowbrickValueError: y must be a (1,) dimensional array not (264, 1)

Now I realize the type of y is DataFrame. If I change it to Series, the code will work, e.g.:

# Same as before, for reference
y = pd.DataFrame().assign(y_val= 3 * X.x1 + 4 * X.x2 + X.x3)

# Change to Series here
y = y["y_val"] 

The conversion to Series certainly is a viable workaround but I'm wondering why it's the case here and not with ResidualsPlot.


Solution

  • There is a draw_best_fit function that PredictionError accesses that checks to see if y is only one dimension and this function isn't used in ResidualPlot. Maybe you can submit a PR suggesting a fix. https://github.com/DistrictDataLabs/yellowbrick/