I have a pandas dataframe like this:
ID Value
0 a 2
1 a 4
2 b 6
3 c 8
4 c 10
5 c 12
I would like to sample equally from the ID
groups. I know I can group the data frame by ID and then specify the number of rows I want to sample from each group like this:
df.groupby("ID").sample(n=2, replace = True)
However, I just want the probability of sampling from a group to be the same, not necessarily the exact same number of rows.
If you want to sample N
rows with about the same probability to sample each group, you could oversample per group then sample again:
import math
N = 4
out = (df.groupby('ID').sample(n=math.ceil(N/df['ID'].nunique()), replace=True)
.sample(N)
)
Example output:
ID Value
2 b 6
2 b 6
4 c 10
1 a 4
With N = 10
:
ID Value
0 a 2
2 b 6
5 c 12
3 c 8
1 a 4
5 c 12
2 b 6
1 a 4
1 a 4
2 b 6
Proportion with N = 100
:
ID
b 0.34
a 0.33
c 0.33
Name: proportion, dtype: float64