pythonpython-3.xdataframerandompandas-groupby

How to sample different number of rows from each group in DataFrame


I have a dataframe with a category column. Df has different number of rows for each category.

category number_of_rows
cat1     19189
cat2     13193
cat3     4500
cat4     1914
cat5     568
cat6     473
cat7     216
cat8     206
cat9     197
cat10    147
cat11    130
cat12    49
cat13    38
cat14    35
cat15    35
cat16    30
cat17    29
cat18    9
cat19    4
cat20    4
cat21    1
cat22    1
cat23    1

I want to select different number of rows from each category. (Instead of n fixed number of rows from each category)

Example input:
size_1 : {"cat1": 40, "cat2": 20, "cat3": 15, "cat4": 11, ...}
Example input: 
size_2 : {"cat1": 51, "cat2": 42, "cat3": 18, "cat4": 21, ...}

What I want to do is actually a stratified sampling with given number of instances corresponding to each category.

Also, it should be randomly selected. For example, I don't need the top 40 values for size_1.["cat1"], I need random 40 values.

Thanks for the help.


Solution

  • TLDR

    output = pd.concat(
         [
             df.loc[df["{category_column_name}"] == {category_value_1}].sample({n_samples}, random_state=42),
             df.loc[df["{category_column_name}"] == {category_value_2}].sample({n_samples}, random_state=42),
         ]
     )
    

    Details and alternatives

    Artificial data generation


    Dataframe

    Let's first generate some data to see how we can solve the problem:

    # Define a DataFrame containing employee data 
    df = pd.DataFrame({'Category':['Jai', 'Jai', 'Jai', 'Princi', 'Princi'], 
            'Age':[27, 24, 22, 32, 15], 
            'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Noida'], 
            'Qualification':['Msc', 'MA', 'MCA', 'Phd', '10th']} )
    

    Sampling rule

    # Number of rows, that we want to be sampled from each category 
    samples_per_group_dict = {'Jai': 1, 
                              'Princi':2}
    


    Problem solving


    I can propose 3 solutions:

    1. Sample individually and concatenate to the resulting DataFrame(which I prefer)
     output = pd.concat(
         [
             df.loc[df["Category"] == "Jai"].sample(1, random_state=42),
             df.loc[df["Category"] == "Princi"].sample(2, random_state=42),
         ]
     )
    
    1. Apply on groupby (one-liner)

      output = df.groupby('Category').apply(lambda group: group.sample(samples_per_group_dict[group.name])).reset_index(drop = True)
      
    2. Looping groups (more verbose using groupby)

      list_of_sampled_groups = []
      
      for name, group in df.groupby('Category'):    
          n_rows_to_sample = samples_per_group_dict[name]
          sampled_group = group.sample(n_rows_to_sample)
          list_of_sampled_groups.append(sampled_group)
      
      output = pd.concat(list_of_sampled_groups).reset_index(drop=True)
      

    Performance should be the same across approaches.
    If performance matters you can vectorize your calculations on 'group by' options, but exact optimization depends on n_groups and n_samples in each group.