pythonscikit-learnlinear-regressionlogarithmoversampling

RandomOverSampler doesn't seem to accept log transform as my y target variable


I am trying to to random oversampling over a small dataset for linear regression. However it seems the scikit learn sampling API doesnt work with float values as its target variable. Is there anyway to solve this?

This is a sample of my y_train values, which are log transformed.

3.688879 3.828641 3.401197 3.091042 4.624973

from imblearn.over_sampling import RandomOverSampler
X_over, y_over = RandomOverSampler(random_state=42).fit_sample(X_train,y_train)
--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-036424abd2bd> in <module>
      1 from imblearn.over_sampling import RandomOverSampler

~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     73             The corresponding label of `X_resampled`.
     74         """
---> 75         check_classification_targets(y)
     76         arrays_transformer = ArraysTransformer(X, y)
     77         X, y, binarize_y = self._check_X_y(X, y)

~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171                       'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'continuous'

Solution

  • Re-sampling strategies are not meant for regression problems. Hence, the RandomOverSampler will not accept float type targets. There are approaches to re-sample data with continuous targets though. One example is the reg_resample which can be used like the following:

    from imblearn.over_sampling import RandomOverSampler
    from sklearn.datasets import make_regression
    from reg_resampler import resampler
    import numpy as np
    
    
    # Create some dummy data for demonstration
    X, y = make_regression(n_features=10)
    df = np.append(X, y.reshape(100, 1), axis=1)
    
    # Initialize the resampler object and generate pseudo-classes
    rs = resampler()
    y_classes = rs.fit(df, target=10)
    
    # Now resample
    X_res, y_res = rs.resample(
        sampler_obj=RandomOverSampler(random_state=27),
        trainX=df,
        trainY=y_classes
    )
    

    The resampler object will generate pseudo-classes based on your target values and then use a classic re-sampling object from the imblearn package to re-sample your data. Note that the data you pass to the resampler object should contain all data, including the targets.