pythonrmachine-learningscikit-learn

Is it possible to transform a target variable using `ravel()` or `to_numpy()` in a `sklearn` pipeline?


I am using RStudio and tidymodels in an R markdown document. I would like to incorporate some models from scikit-learn. Getting data from the R code chunks to the Python code chunk works well, but when I train and test a model using the following code:

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

log_reg_pipe = Pipeline([
  ('Logistic Regression', LogisticRegression())
])

log_reg_pipe.fit(X_train, y_train).score(X_val, y_val)

I get the error

DataConversionWarning: A column-vector y was passed when a 1d array was expected. 
Please change the shape of y to (n_samples, ), for example using ravel().

I can solve it by training the data using y_train['clinical_course'].to_numpy(), but I would ideally like this to be done directly in the pipeline. Is this possible?

Note that the code above is just a simple example to show my problem. In this case X_train has four columns and y_train has one.

As described above I tried to use .to_numpy(), but I would like a solution that does all the transformations within the pipeline.


Solution

  • I don't think this is possible: sklearn pipelines don't support transforming the target variable. See https://stackoverflow.com/a/62826301/10495893 for some notes about that.

    (There is TransformedTargetRegressor, but that's for e.g. log-transforming the target before fitting a regressor. I don't think there's a way to hack it to working with a classifier.)

    IMO, since throughout much of sklearn y is taken to be 1D, that should happen outside pipelines. You probably don't need to_numpy, just slicing to a pandas Series should be enough, and could be done sooner in your workflow, e.g. y = df['clinical_course'].