pythonpandas

Sampling a pandas DataFrame and balance data across group


Trying to figure out how to use pandas.DataFrame.sample or any other function to balance this data:

Given this DataFrame:

d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
}
df = pd.DataFrame(d)

I would like to extract a sample that is balanced across the group class, that is, we should fix the class, and extract the same number of observations for each class.

In particular, I would like the per-class number of observations to equal the sample size of the smallest class.

UPDATE

piRSquared's answer works but:

How g.apply(lambda x: x.sample(g.size().min())) works? I know what 'lambda` is, but:


Solution

  • g = df.groupby('class')
    g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
    
      class  val
    0    c1    1
    1    c1    1
    2    c2    2
    3    c2    2
    4    c3    3
    5    c3    3
    

    Answers to your follow-up questions

    1. The x in the lambda ends up being a dataframe that is the subset of df represented by the group. Each of these dataframes, one for each group, gets passed through this lambda.
    2. g is the groupby object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size() is an alternative way to do df['class'].value_counts() but since I was going to groupby anyway, I might as well reuse the same groupby, use a size to get the value counts... saves time.
    3. Those numbers are the the index values from df that go with the sampling. I added reset_index(drop=True) to get rid of it.