I have a dataframe with a category column. Df has different number of rows for each category.
category number_of_rows
cat1 19189
cat2 13193
cat3 4500
cat4 1914
cat5 568
cat6 473
cat7 216
cat8 206
cat9 197
cat10 147
cat11 130
cat12 49
cat13 38
cat14 35
cat15 35
cat16 30
cat17 29
cat18 9
cat19 4
cat20 4
cat21 1
cat22 1
cat23 1
I want to select different number of rows from each category. (Instead of n fixed number of rows from each category)
Example input:
size_1 : {"cat1": 40, "cat2": 20, "cat3": 15, "cat4": 11, ...}
Example input:
size_2 : {"cat1": 51, "cat2": 42, "cat3": 18, "cat4": 21, ...}
What I want to do is actually a stratified sampling with given number of instances corresponding to each category.
Also, it should be randomly selected. For example, I don't need the top 40 values for size_1.["cat1"], I need random 40 values.
Thanks for the help.
output = pd.concat(
[
df.loc[df["{category_column_name}"] == {category_value_1}].sample({n_samples}, random_state=42),
df.loc[df["{category_column_name}"] == {category_value_2}].sample({n_samples}, random_state=42),
]
)
Let's first generate some data to see how we can solve the problem:
# Define a DataFrame containing employee data
df = pd.DataFrame({'Category':['Jai', 'Jai', 'Jai', 'Princi', 'Princi'],
'Age':[27, 24, 22, 32, 15],
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Noida'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd', '10th']} )
# Number of rows, that we want to be sampled from each category
samples_per_group_dict = {'Jai': 1,
'Princi':2}
I can propose 3 solutions:
output = pd.concat(
[
df.loc[df["Category"] == "Jai"].sample(1, random_state=42),
df.loc[df["Category"] == "Princi"].sample(2, random_state=42),
]
)
Apply on groupby (one-liner)
output = df.groupby('Category').apply(lambda group: group.sample(samples_per_group_dict[group.name])).reset_index(drop = True)
Looping groups (more verbose using groupby)
list_of_sampled_groups = []
for name, group in df.groupby('Category'):
n_rows_to_sample = samples_per_group_dict[name]
sampled_group = group.sample(n_rows_to_sample)
list_of_sampled_groups.append(sampled_group)
output = pd.concat(list_of_sampled_groups).reset_index(drop=True)
Performance should be the same across approaches.
If performance matters you can vectorize your calculations on 'group by' options, but exact optimization depends on n_groups and n_samples in each group.