Trying to figure out how to use pandas.DataFrame.sample
or any other function to balance this data:
Given this DataFrame:
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
'val': [1,2,1,1,2,1,1,2,3,3]
}
df = pd.DataFrame(d)
I would like to extract a sample that is balanced across the group class
, that is, we should fix the class, and extract the same number of observations for each class.
In particular, I would like the per-class number of observations to equal the sample size of the smallest class.
piRSquared's answer works but:
How g.apply(lambda x: x.sample(g.size().min()))
works? I know what 'lambda` is, but:
lambda
in x
in this case?g
in g.size()
?g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
class val
0 c1 1
1 c1 1
2 c2 2
3 c2 2
4 c3 3
5 c3 3
Answers to your follow-up questions
x
in the lambda
ends up being a dataframe that is the subset of df
represented by the group. Each of these dataframes, one for each group, gets passed through this lambda
.g
is the groupby
object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size()
is an alternative way to do df['class'].value_counts()
but since I was going to groupby
anyway, I might as well reuse the same groupby
, use a size
to get the value counts... saves time.df
that go with the sampling. I added reset_index(drop=True)
to get rid of it.