[SOLVED] Sampling a pandas DataFrame and balance data across group

Sampling a pandas DataFrame and balance data across group

Trying to figure out how to use pandas.DataFrame.sample or any other function to balance this data:

Given this DataFrame:

d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
}
df = pd.DataFrame(d)

I would like to extract a sample that is balanced across the group class, that is, we should fix the class, and extract the same number of observations for each class.

In particular, I would like the per-class number of observations to equal the sample size of the smallest class.

UPDATE

piRSquared's answer works but:

How g.apply(lambda x: x.sample(g.size().min())) works? I know what 'lambda` is, but:

What is passed to lambda in x in this case?
What is g in g.size()?
Why output contains 6,5,4, 1,8,9 numbers? What do they mean?

Solution

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))

  class  val
0    c1    1
1    c1    1
2    c2    2
3    c2    2
4    c3    3
5    c3    3

Answers to your follow-up questions

The x in the lambda ends up being a dataframe that is the subset of df represented by the group. Each of these dataframes, one for each group, gets passed through this lambda.
g is the groupby object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size() is an alternative way to do df['class'].value_counts() but since I was going to groupby anyway, I might as well reuse the same groupby, use a size to get the value counts... saves time.
Those numbers are the the index values from df that go with the sampling. I added reset_index(drop=True) to get rid of it.