pythonpandasscikit-learnsubsampling

Scikit-learn balanced subsampling


I'm trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this?

These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers.

In Weka there is tool called spreadsubsample, is there equivalent in sklearn? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

(I know about weighting but that's not what I'm looking for.)


Solution

  • Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)

    This function creates single random balanced subsample.

    edit: The subsample size now samples down minority classes, this should probably be changed.

    def balanced_subsample(x,y,subsample_size=1.0):
    
        class_xs = []
        min_elems = None
    
        for yi in np.unique(y):
            elems = x[(y == yi)]
            class_xs.append((yi, elems))
            if min_elems == None or elems.shape[0] < min_elems:
                min_elems = elems.shape[0]
    
        use_elems = min_elems
        if subsample_size < 1:
            use_elems = int(min_elems*subsample_size)
    
        xs = []
        ys = []
    
        for ci,this_xs in class_xs:
            if len(this_xs) > use_elems:
                np.random.shuffle(this_xs)
    
            x_ = this_xs[:use_elems]
            y_ = np.empty(use_elems)
            y_.fill(ci)
    
            xs.append(x_)
            ys.append(y_)
    
        xs = np.concatenate(xs)
        ys = np.concatenate(ys)
    
        return xs,ys
    

    For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:

    1. Replace the np.random.shuffle line with

      this_xs = this_xs.reindex(np.random.permutation(this_xs.index))

    2. Replace the np.concatenate lines with

      xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')