I'm working with a dataset of 2,500 samples, and I need to extract a random subset of 666 samples while meeting specific conditions based on two boolean columns.
The dataset (df
) contains the following columns:
ID
cond_1
(boolean: True/False)cond_2
(boolean: True/False)When sampling the subset, I need to ensure these conditions are satisfied:
True
values in cond_1
should be exactly 181.True
values in cond_2
should be exactly 181.False
values in cond_1
should be exactly 485.False
values in cond_2
should be exactly 485.Additionally:
cond_1
and cond_2
are either both True
or both False
. There must be combination of True/False
or False/True
.Is there a recommended way to achieve this type of constrained sampling in Python?
Any code examples or libraries that could help would be greatly appreciated.
Thanks in advance.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': range(1, 2501),
'cond_1': np.random.choice([True, False], size=2500),
'cond_2': np.random.choice([True, False], size=2500)
})
# Create the groups based on conditions
group_cond1_true = df[df['cond_1'] == True]
group_cond1_false = df[df['cond_1'] == False]
group_cond2_true = df[df['cond_2'] == True]
group_cond2_false = df[df['cond_2'] == False]
# Sample from each group to meet constraints
sample_cond1_true = group_cond1_true.sample(n=181, random_state=42)
sample_cond1_false = group_cond1_false.sample(n=485, random_state=42)
sample_cond2_true = group_cond2_true.sample(n=181, random_state=42)
sample_cond2_false = group_cond2_false.sample(n=485, random_state=42)
# Combine the samples to create the final dataset
df_extracted = pd.concat([sample_cond1_true, sample_cond1_false, sample_cond2_true, sample_cond2_false]).drop_duplicates()
# Check if the conditions are met
print(f"Total length: {len(df_extracted)}")
print(f"cond_1 True count: {df_extracted['cond_1'].sum()}")
print(f"cond_2 True count: {df_extracted['cond_2'].sum()}")
print(f"cond_1 False count: {(~df_extracted['cond_1']).sum()}")
print(f"cond_2 False count: {(~df_extracted['cond_2']).sum()}")
print()
count = df_extracted.groupby(['cond_1', 'cond_2']).size()
print("Extracted counts:\n", count)
The code above generated df_extracted with the size of 1148, not 666.
Total length: 1148
cond_1 True count: 454
cond_2 True count: 460
cond_1 False count: 694
cond_2 False count: 688
Extracted counts:
cond_1 cond_2
False False 396
True 298
True False 292
True 162
dtype: int64
Code
First, create a data frame with a randomized array of a set number of True and False, and give it a cumcount.
import numpy as np
import pandas as pd
# Create a True/False array
n_true, n_false = 181, 485
values = np.array([True] * n_true + [False] * n_false)
# Randomly permute the array twice to create two columns in the DataFrame
df_sample = pd.DataFrame({
'cond_1': np.random.permutation(values),
'cond_2': np.random.permutation(values)
})
# Add a cumcount column 'cc' in each group of 'cond_1' and 'cond_2'
df_sample['cc'] = df_sample.groupby(['cond_1', 'cond_2']).cumcount()
df_sample:
cond_1 cond_2 cc
0 True False 0
1 True False 1
2 False False 0
.. ... ... ...
663 True True 50
664 False True 128
665 False True 129
[666 rows x 3 columns]
Second, randomize df
and then give cumcount to extract the IDs that match df_sample
.
target_id = df_sample.merge(
df.assign(cc=df.sample(frac=1).groupby(['cond_1', 'cond_2']).cumcount()),
how='left')['ID']
out = df[df['ID'].isin(target_id)]
out
ID cond_1 cond_2
5 6 True True
8 9 False True
12 13 False False
... ... ... ...
2487 2488 True False
2489 2490 True False
2490 2491 False False
[666 rows x 3 columns]
If you don't want the results to be sorted by ID, use out.sample(frac=1)
instead of out
.
Third, verify that the sample was extracted correctly.
print('number of rows :', len(out))
print('Number of True in all columns : \n', out[['cond_1', 'cond_2']].sum())
output:
number of rows : 666
Number of True in all columns :
cond_1 181
cond_2 181
dtype: int64
In this case, it should meet your desired result, but if it doesn't, run the code again from the beginning.