pythonscikit-learncox-regressionscikit-learn-pipeline

How to pass parameters to this sklearn Cox model in a Pipeline?


If I run the following Python code it works well:

target = 'churn'
tranOH = ColumnTransformer([ ('one', OneHotEncoder(drop='first', dtype='int'), 
make_column_selector(dtype_include='category', pattern=f"^(?!{target}).*")   
) ], remainder='passthrough')


dftrain2 = tranOH.fit_transform(dftrain)
cph = CoxPHFitter(penalizer=0.1)
cph.fit(dftrain2, 'months', 'churn')

But if I try to do it with a Pipeline I get an error:

mcox = Pipeline(steps=[
("onehot", tranOH),
('modelo', CoxPHFitter(penalizer=0.1)) 
])

mcox.fit(dftrain, modelo__duration_col="months", modelo__event_col='churn')

It says:

TypeError                                 Traceback (most recent call last)
Cell In[88], line 6
      1 mcox = Pipeline(steps=[
      2     ("onehot", tranOH),
      3     ('modelo', CoxPHFitter(penalizer=0.1)) 
      4     ])
----> 6 mcox.fit(dftrain, modelo__duration_col="months", modelo__event_col=target)

File ~\AppData\Roaming\Python\Python310\site-packages\sklearn\base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1466     estimator._validate_params()
   1468 with config_context(
   1469     skip_parameter_validation=(
   1470         prefer_skip_nested_validation or global_skip_validation
   1471     )
   1472 ):
-> 1473     return fit_method(estimator, *args, **kwargs)

File ~\AppData\Roaming\Python\Python310\site-packages\sklearn\pipeline.py:473, in Pipeline.fit(self, X, y, **params)
    471     if self._final_estimator != "passthrough":
    472         last_step_params = routed_params[self.steps[-1][0]]
--> 473         self._final_estimator.fit(Xt, y, **last_step_params["fit"])
    475 return self

File ~\AppData\Roaming\Python\Python310\site-packages\lifelines\utils\__init__.py:56, in CensoringType.right_censoring.<locals>.f(model, *args, **kwargs)
     53 @wraps(function)
     54 def f(model, *args, **kwargs):
     55     cls.set_censoring_type(model, cls.RIGHT)
---> 56     return function(model, *args, **kwargs)

TypeError: CoxPHFitter.fit() got multiple values for argument 'duration_col'

tranOH is a Columntransformer that onehot encodes all categorical columns except 'churn'.

I have also tried using col="months" and event_col=target directly inside CoxPHFitter() but I get the same error.

Later I want to use it to perform a GridSearchCV to finetune the penalizer parameter, optimizing the accuracy score to predict churn at a given time="months".

I don't have the same problem with other models, for example if I replace CoxPHFitter with LogisticRegression it works well.


Solution

  • CoxPHFitter doesn't abide by the sklearn API: its fit method takes df followed by other arguments (duration_col first), not X then y followed by other arguments.

    This is mostly OK, because you can just pass the whole frame (including the target and duration columns) as X, and sklearn pipelines will just take that as an unsupervised setting, i.e. the default value y=None. But, then we get to fitting the last step of the pipeline, and this gets called:

    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
    

    (source). y here will still be None, but since CoxPHFitter takes its second argument as being duration_col, this means gets set to both None (via the y argument here) and then "months" (your kwarg parameter).

    I don't think there's an easy way to fix this. lifelines provides an sklearn_wrapper, but that seems to have its own issues, being slated for removal in version 0.28. I would just keep the model separate from the preprocessing pipeline.