pythonpandasnumpycombinatorics

Generating combinations in pandas dataframe


I have a dataset with ["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'] columns. All values in ["Uni", 'Region', "Profession"] are filled while ["Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'] always contain NAs.

For each column with NAs there are several possible values

Level_Edu = ['undergrad', 'grad', 'PhD']
Financial_Base = ['personal', 'grant']
Learning_Time = ["morning", "day", "evening"]
GENDER = ['Male', 'Female']

I want to generate all possible combinations of ["Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'] for each observation in the initial data. So that each initial observation would be represented by 36 new observations (obtained by the formula of combinatorics: N1 * N2 * N3 * N4, where Ni is the length of the i-th vector of possible values for a column)

Here is a Python code for recreating two initial observations and approximation of the result I desire to get (showing 3 combinations out of 36 for each initial observation I want).

import pandas as pd
import numpy as np
sample_data_as_is = pd.DataFrame([["X1", "Y1", "Z1", np.nan, np.nan, np.nan, np.nan], ["X2", "Y2", "Z2", np.nan, np.nan, np.nan, np.nan]], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])

sample_data_to_be = pd.DataFrame([["X1", "Y1", "Z1", "undergrad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "personal", "morning", 'Male'],
                                  ["X1", "Y1", "Z1", "grad", "personal", "morning", 'Male'], ["X2", "Y2", "Z2", "grad", "personal", "morning", 'Male'],
                                  ["X1", "Y1", "Z1", "undergrad", "grant", "morning", 'Male'], ["X2", "Y2", "Z2", "undergrad", "grant", "morning", 'Male']], columns=["Uni", 'Region', "Profession", "Level_Edu", 'Financial_Base', 'Learning_Time', 'GENDER'])


Solution

  • You can combine itertools.product and a cross-merge:

    from itertools import product
    
    data = {'Level_Edu': ['undergrad', 'grad', 'PhD'],
            'Financial_Base': ['personal', 'grant'],
            'Learning_Time': ['morning', 'day', 'evening'],
            'GENDER': ['Male', 'Female']}
    
    out = (sample_data_as_is[['Uni', 'Region', 'Profession']]
           .merge(pd.DataFrame(product(*data.values()), columns=data.keys()), how='cross')
          )
    

    Output:

       Uni Region Profession  Level_Edu Financial_Base Learning_Time  GENDER
    0   X1     Y1         Z1  undergrad       personal       morning    Male
    1   X1     Y1         Z1  undergrad       personal       morning  Female
    2   X1     Y1         Z1  undergrad       personal           day    Male
    3   X1     Y1         Z1  undergrad       personal           day  Female
    4   X1     Y1         Z1  undergrad       personal       evening    Male
    ..  ..    ...        ...        ...            ...           ...     ...
    67  X2     Y2         Z2        PhD          grant       morning  Female
    68  X2     Y2         Z2        PhD          grant           day    Male
    69  X2     Y2         Z2        PhD          grant           day  Female
    70  X2     Y2         Z2        PhD          grant       evening    Male
    71  X2     Y2         Z2        PhD          grant       evening  Female
    
    [72 rows x 7 columns]
    

    If you want the specific order of rows/columns from your expected output:

    cols = ['Uni', 'Region', 'Profession']
    out = (pd.DataFrame(product(*data.values()), columns=data.keys())
             .merge(sample_data_as_is[cols], how='cross')
             [cols+list(data)]
          )
    

    Output:

       Uni Region Profession  Level_Edu Financial_Base Learning_Time  GENDER
    0   X1     Y1         Z1  undergrad       personal       morning    Male
    1   X2     Y2         Z2  undergrad       personal       morning    Male
    2   X1     Y1         Z1  undergrad       personal       morning  Female
    3   X2     Y2         Z2  undergrad       personal       morning  Female
    4   X1     Y1         Z1  undergrad       personal           day    Male
    ..  ..    ...        ...        ...            ...           ...     ...
    67  X2     Y2         Z2        PhD          grant           day  Female
    68  X1     Y1         Z1        PhD          grant       evening    Male
    69  X2     Y2         Z2        PhD          grant       evening    Male
    70  X1     Y1         Z1        PhD          grant       evening  Female
    71  X2     Y2         Z2        PhD          grant       evening  Female
    
    [72 rows x 7 columns]