If I run the following Python code it works well:
target = 'churn'
tranOH = ColumnTransformer([ ('one', OneHotEncoder(drop='first', dtype='int'),
make_column_selector(dtype_include='category', pattern=f"^(?!{target}).*")
) ], remainder='passthrough')
dftrain2 = tranOH.fit_transform(dftrain)
cph = CoxPHFitter(penalizer=0.1)
cph.fit(dftrain2, 'months', 'churn')
But if I try to do it with a Pipeline I get an error:
mcox = Pipeline(steps=[
("onehot", tranOH),
('modelo', CoxPHFitter(penalizer=0.1))
])
mcox.fit(dftrain, modelo__duration_col="months", modelo__event_col='churn')
It says:
TypeError Traceback (most recent call last)
Cell In[88], line 6
1 mcox = Pipeline(steps=[
2 ("onehot", tranOH),
3 ('modelo', CoxPHFitter(penalizer=0.1))
4 ])
----> 6 mcox.fit(dftrain, modelo__duration_col="months", modelo__event_col=target)
File ~\AppData\Roaming\Python\Python310\site-packages\sklearn\base.py:1473, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1466 estimator._validate_params()
1468 with config_context(
1469 skip_parameter_validation=(
1470 prefer_skip_nested_validation or global_skip_validation
1471 )
1472 ):
-> 1473 return fit_method(estimator, *args, **kwargs)
File ~\AppData\Roaming\Python\Python310\site-packages\sklearn\pipeline.py:473, in Pipeline.fit(self, X, y, **params)
471 if self._final_estimator != "passthrough":
472 last_step_params = routed_params[self.steps[-1][0]]
--> 473 self._final_estimator.fit(Xt, y, **last_step_params["fit"])
475 return self
File ~\AppData\Roaming\Python\Python310\site-packages\lifelines\utils\__init__.py:56, in CensoringType.right_censoring.<locals>.f(model, *args, **kwargs)
53 @wraps(function)
54 def f(model, *args, **kwargs):
55 cls.set_censoring_type(model, cls.RIGHT)
---> 56 return function(model, *args, **kwargs)
TypeError: CoxPHFitter.fit() got multiple values for argument 'duration_col'
tranOH is a Columntransformer that onehot encodes all categorical columns except 'churn'.
I have also tried using col="months"
and event_col=target
directly inside CoxPHFitter()
but I get the same error.
Later I want to use it to perform a GridSearchCV to finetune the penalizer parameter, optimizing the accuracy score to predict churn at a given time="months".
I don't have the same problem with other models, for example if I replace CoxPHFitter with LogisticRegression it works well.
CoxPHFitter
doesn't abide by the sklearn API: its fit
method takes df
followed by other arguments (duration_col
first), not X
then y
followed by other arguments.
This is mostly OK, because you can just pass the whole frame (including the target and duration columns) as X
, and sklearn pipelines will just take that as an unsupervised setting, i.e. the default value y=None
. But, then we get to fitting the last step of the pipeline, and this gets called:
self._final_estimator.fit(Xt, y, **last_step_params["fit"])
(source). y
here will still be None
, but since CoxPHFitter
takes its second argument as being duration_col
, this means gets set to both None
(via the y
argument here) and then "months"
(your kwarg parameter).
I don't think there's an easy way to fix this. lifelines
provides an sklearn_wrapper
, but that seems to have its own issues, being slated for removal in version 0.28. I would just keep the model separate from the preprocessing pipeline.