machine-learningscikit-learnneuraxle

Using fit_params in neuraxle pipeline


I want to use a classifier, e.g. the sklearn.linear_model.SGDClassifier, within a neuraxle pipeline and fit it in an online fashion using partial_fit. I have the classifier wrapped in an SKLearnWrapper with use_partial_fit=True, like this:

from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from sklearn.linear_model import SGDClassifier

p = Pipeline([
    SKLearnWrapper(SGDClassifier(), use_partial_fit=True)
    ]
)

X = [[1.], [2.], [3.]]
y = ['class1', 'class2', 'class1']

p.fit(X, y)

However, to fit the classifier in online fashion, one needs to provide an additional argument classes to the partial_fit function, that contains the possible classes that are occurring in the data, e.g. classes=['class1', 'class2'], at least for the first time it is called. So the above code results in an error:

ValueError: classes must be passed on the first call to partial_fit.

The same issue arises for other fit_params like sample_weight. In a standard sklearn pipeline, fit_params can be handed down to individual steps via the <step name>__<parameter name> syntax, e.g. for the sample_weight parameter:

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

q = Pipeline([
    ('clf', SGDClassifier())
])

q.fit(X, y, clf__sample_weight=[0.25, 0.5, 0.25])

Of course, the standard sklearn pipeline does not allow to call partial_fit on the classifier, which is why I want to use the neuraxle pipeline in the first place.

Is there any way to hand additional parameters to the fit or partial_fit functions of a step in a neuraxle pipeline?


Solution

  • I suggest that you edit the SKLearnWrapper so as to add arguments to the partial_fit method by redefining it and to add the missing arguments you would like to have.

    You could also add a method to this forked SKLearnWrapper as follow. The classes arguments could be changed using an apply method called from outside the pipeline later on.

    ConfigurablePartialSGDClassifier(SKLearnWrapper)
    
        def __init__(self):
            super().__init__(SGDClassifier(), use_partial_fit=True)
    
        def update_classes(self, classes: List[str]):
            self.classes = classes
    
        def _sklearn_fit_without_expected_outputs(self, data_inputs):
            self.wrapped_sklearn_predictor.partial_fit(data_inputs, classes=self.classes)
    

    You can then do:

    p = Pipeline([
        ('clf', ConfigurablePartialSGDClassifier())
    ])
    
    X1 = [[1.], [2.], [3.]]
    X2 = [[4.], [5.], [6.]]
    Y1 = [0, 1, 1]
    Y2 = [1, 1, 0]
    classes = ['class1', 'class2', 'class1']
    
    p.apply("update_classes", classes)
    p.fit(X1, Y1)
    p.fit(X2, Y2)
    
    

    Note that p could also simply have been defined this way to get the same behavior:

    p = ConfigurablePartialSGDClassifier()
    

    The thing is, calls to apply methods can pass through pipelines and are applied to all nested steps if the steps contain such methods.