pythonpandasnumpysampling

How to randomly sample a dataset with specific constraints on two boolean conditions by Python?


I'm working with a dataset of 2,500 samples, and I need to extract a random subset of 666 samples while meeting specific conditions based on two boolean columns.

The dataset (df) contains the following columns:

When sampling the subset, I need to ensure these conditions are satisfied:

  1. The number of True values in cond_1 should be exactly 181.
  2. The number of True values in cond_2 should be exactly 181.
  3. The number of False values in cond_1 should be exactly 485.
  4. The number of False values in cond_2 should be exactly 485.
  5. The total number of samples in the subset should be exactly 666.

Additionally:

Is there a recommended way to achieve this type of constrained sampling in Python?

Any code examples or libraries that could help would be greatly appreciated.

Thanks in advance.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'ID': range(1, 2501),
    'cond_1': np.random.choice([True, False], size=2500),
    'cond_2': np.random.choice([True, False], size=2500)
})

# Create the groups based on conditions
group_cond1_true = df[df['cond_1'] == True]
group_cond1_false = df[df['cond_1'] == False]

group_cond2_true = df[df['cond_2'] == True]
group_cond2_false = df[df['cond_2'] == False]

# Sample from each group to meet constraints
sample_cond1_true = group_cond1_true.sample(n=181, random_state=42)
sample_cond1_false = group_cond1_false.sample(n=485, random_state=42)

sample_cond2_true = group_cond2_true.sample(n=181, random_state=42)
sample_cond2_false = group_cond2_false.sample(n=485, random_state=42)

# Combine the samples to create the final dataset
df_extracted = pd.concat([sample_cond1_true, sample_cond1_false, sample_cond2_true, sample_cond2_false]).drop_duplicates()

# Check if the conditions are met
print(f"Total length: {len(df_extracted)}")
print(f"cond_1 True count: {df_extracted['cond_1'].sum()}")
print(f"cond_2 True count: {df_extracted['cond_2'].sum()}")
print(f"cond_1 False count: {(~df_extracted['cond_1']).sum()}")
print(f"cond_2 False count: {(~df_extracted['cond_2']).sum()}")
print()
count = df_extracted.groupby(['cond_1', 'cond_2']).size()
print("Extracted counts:\n", count)

The code above generated df_extracted with the size of 1148, not 666.

Total length: 1148
cond_1 True count: 454
cond_2 True count: 460
cond_1 False count: 694
cond_2 False count: 688

Extracted counts:
 cond_1  cond_2
False   False     396
        True      298
True    False     292
        True      162
dtype: int64

Solution

  • Code

    First, create a data frame with a randomized array of a set number of True and False, and give it a cumcount.

    import numpy as np
    import pandas as pd
    
    # Create a True/False array
    n_true, n_false = 181, 485
    values = np.array([True] * n_true + [False] * n_false)
    
    # Randomly permute the array twice to create two columns in the DataFrame
    df_sample = pd.DataFrame({
        'cond_1': np.random.permutation(values),
        'cond_2': np.random.permutation(values)
    })
    
    # Add a cumcount column 'cc' in each group of 'cond_1' and 'cond_2'
    df_sample['cc'] = df_sample.groupby(['cond_1', 'cond_2']).cumcount()
    

    df_sample:

         cond_1  cond_2   cc
    0      True   False    0
    1      True   False    1
    2     False   False    0
    ..      ...     ...  ...
    663    True    True   50
    664   False    True  128
    665   False    True  129
    
    [666 rows x 3 columns]
    

    Second, randomize df and then give cumcount to extract the IDs that match df_sample.

    target_id = df_sample.merge(
        df.assign(cc=df.sample(frac=1).groupby(['cond_1', 'cond_2']).cumcount()), 
        how='left')['ID']
    out = df[df['ID'].isin(target_id)]
    

    out

            ID  cond_1  cond_2
    5        6    True    True
    8        9   False    True
    12      13   False   False
    ...    ...     ...     ...
    2487  2488    True   False
    2489  2490    True   False
    2490  2491   False   False
    
    [666 rows x 3 columns]
    

    If you don't want the results to be sorted by ID, use out.sample(frac=1) instead of out.


    Third, verify that the sample was extracted correctly.

    print('number of rows :', len(out))
    print('Number of True in all columns : \n', out[['cond_1', 'cond_2']].sum())
    

    output:

    number of rows : 666
    Number of True in all columns : 
     cond_1    181
    cond_2    181
    dtype: int64
    

    In this case, it should meet your desired result, but if it doesn't, run the code again from the beginning.