pythonnumpyscikit-learnsubsampling

random subsampling of the majority class


I have an unbalanced data and I want to perform a random subsampling on the majority class where each subsample will be the same size as the minority class ... I think this is already implemented on Weka and Matlab, is there an equivalent to this on sklearn ?


Solution

  • Say your data looks like something generated from this code:

    import numpy as np
    
    x = np.random.randn(100, 3)
    y = np.array([int(i % 5 == 0) for i in range(100)])
    

    (only a 1/5th of y is 1, which is the minority class).

    To find the size of the minority class, do:

    >>> np.sum(y == 1)
    20
    

    To find the subset that consists of the majority class, do:

    majority_x, majority_y = x[y == 0, :], y[y == 0]
    

    To find a random subset of size 20, do:

    inds = np.random.choice(range(majority_x.shape[0]), 20)
    

    followed by

    majority_x[inds, :]
    

    and

    majority_y[inds]