I am trying to to random oversampling over a small dataset for linear regression. However it seems the scikit learn sampling API doesnt work with float values as its target variable. Is there anyway to solve this?
This is a sample of my y_train values, which are log transformed.
3.688879 3.828641 3.401197 3.091042 4.624973
from imblearn.over_sampling import RandomOverSampler
X_over, y_over = RandomOverSampler(random_state=42).fit_sample(X_train,y_train)
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-53-036424abd2bd> in <module>
1 from imblearn.over_sampling import RandomOverSampler
~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
73 The corresponding label of `X_resampled`.
74 """
---> 75 check_classification_targets(y)
76 arrays_transformer = ArraysTransformer(X, y)
77 X, y, binarize_y = self._check_X_y(X, y)
~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
170 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
171 'multilabel-indicator', 'multilabel-sequences']:
--> 172 raise ValueError("Unknown label type: %r" % y_type)
173
174
ValueError: Unknown label type: 'continuous'
Re-sampling strategies are not meant for regression problems. Hence, the RandomOverSampler
will not accept float
type targets. There are approaches to re-sample data with continuous targets though. One example is the reg_resample
which can be used like the following:
from imblearn.over_sampling import RandomOverSampler
from sklearn.datasets import make_regression
from reg_resampler import resampler
import numpy as np
# Create some dummy data for demonstration
X, y = make_regression(n_features=10)
df = np.append(X, y.reshape(100, 1), axis=1)
# Initialize the resampler object and generate pseudo-classes
rs = resampler()
y_classes = rs.fit(df, target=10)
# Now resample
X_res, y_res = rs.resample(
sampler_obj=RandomOverSampler(random_state=27),
trainX=df,
trainY=y_classes
)
The resampler
object will generate pseudo-classes based on your target values and then use a classic re-sampling object from the imblearn
package to re-sample your data. Note that the data you pass to the resampler
object should contain all data, including the targets.