[SOLVED] random subsampling of the majority class

random subsampling of the majority class

I have an unbalanced data and I want to perform a random subsampling on the majority class where each subsample will be the same size as the minority class ... I think this is already implemented on Weka and Matlab, is there an equivalent to this on sklearn ?

Solution

Say your data looks like something generated from this code:

import numpy as np

x = np.random.randn(100, 3)
y = np.array([int(i % 5 == 0) for i in range(100)])

(only a 1/5th of y is 1, which is the minority class).

To find the size of the minority class, do:

>>> np.sum(y == 1)
20

To find the subset that consists of the majority class, do:

majority_x, majority_y = x[y == 0, :], y[y == 0]

To find a random subset of size 20, do:

inds = np.random.choice(range(majority_x.shape[0]), 20)

followed by

majority_x[inds, :]

and

majority_y[inds]