I am using RStudio and tidymodels in an R markdown document. I would like to incorporate some models from scikit-learn. Getting data from the R code chunks to the Python code chunk works well, but when I train and test a model using the following code:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
log_reg_pipe = Pipeline([
('Logistic Regression', LogisticRegression())
])
log_reg_pipe.fit(X_train, y_train).score(X_val, y_val)
I get the error
DataConversionWarning: A column-vector y was passed when a 1d array was expected.
Please change the shape of y to (n_samples, ), for example using ravel().
I can solve it by training the data using y_train['clinical_course'].to_numpy()
, but I would ideally like this to be done directly in the pipeline. Is this possible?
Note that the code above is just a simple example to show my problem. In this case X_train
has four columns and y_train
has one.
As described above I tried to use .to_numpy()
, but I would like a solution that does all the transformations within the pipeline.
I don't think this is possible: sklearn pipelines don't support transforming the target variable. See https://stackoverflow.com/a/62826301/10495893 for some notes about that.
(There is TransformedTargetRegressor
, but that's for e.g. log-transforming the target before fitting a regressor. I don't think there's a way to hack it to working with a classifier.)
IMO, since throughout much of sklearn y
is taken to be 1D, that should happen outside pipelines. You probably don't need to_numpy
, just slicing to a pandas Series should be enough, and could be done sooner in your workflow, e.g. y = df['clinical_course']
.