pythonsqlreshape

Long to wide in Python or SQL with all possible data combinations in columns


Input format:

CUSTOMER    MONTH   ISSUE
1   M1  ABC
1   M1  DEF
1   M2  ABC
1   M3  QRS
2   M1  PQR
2   M2  PQR
2   M2  ABC
2   M3  DEF

Desired output format:

CUSTOMER    M1  M2  M3
1   ABC ABC QRS
1   DEF ABC QRS
2   PQR PQR DEF
2   PQR ABC DEF

I want to find out the possible distinct combinations at customer level and eventually sum of to get a count of such patterns across customers.

I can achieve the above using SQL joins on the same table, but the data is very huge & the number of combinations is very high. So it isn't efficient.

Pivot in Python & SQL wouldn't consider these duplicates. Is there any other solution I could try?

Thank you

I can achieve the above using SQL joins on the same table, but the data is very huge & the number of combinations is very high. So it isn't efficient.

Pivot in Python & SQL wouldn't consider these duplicates. Pivot error in Python: ValueError: Index contains duplicate entries, cannot reshape


Solution

  • In python you can achieve it using underneath segment a sample. You need to pass your data to the dataframe and will need to tweek the pivot rows and groupby column

    import pandas as pd
    
    data = {
        'CUSTOMER': [1, 2, 2],
        'MONTH': ['M1', 'M1', 'M2'],
        'ISSUE': ['ABC', 'DEF', 'ABC']
    }
    
    df = pd.DataFrame(data)
    
    df['issue_count'] = df.groupby(['CUSTOMER', 'MONTH']).cumcount()
    
    pivot_df = df.pivot_table(index=['CUSTOMER', 'issue_count'], columns='MONTH', values='ISSUE', aggfunc='first').reset_index()
    pivot_df = pivot_df.drop('issue_count', axis=1).fillna('')#droping the enumeration
    
    
    pivot_df = pivot_df[['CUSTOMER', 'M1', 'M2']]
    
    print(pivot_df)
    print("\n\n")
    
    pattern_counts = pivot_df.groupby(['M1', 'M2']).size().reset_index(name='Count')
    
    print(pattern_counts)
    

    Output:

    MONTH  CUSTOMER   M1   M2
    0             1  ABC
    1             2  DEF  ABC
    
    
    
        M1   M2  Count
    0  ABC           1
    1  DEF  ABC      1
    

    To get complete list of combination

    import pandas as pd
    from itertools import product
    data = {
        'CUSTOMER': [1, 2, 2],
        'MONTH': ['M1', 'M1', 'M2'],
        'ISSUE': ['ABC', 'DEF', 'ABC']
    }
    
    df = pd.DataFrame(data)
    
    unique_issues = {month: df[df['MONTH'] == month]['ISSUE'].unique() for month in df['MONTH'].unique()}
    all_combinations = []
    for customer in df['CUSTOMER'].unique():
        for combination in product(*[unique_issues[month] for month in unique_issues]):
            combo_dict = {'CUSTOMER': customer}
            combo_dict.update({f'M{i+1}': issue for i, issue in enumerate(combination)})
            all_combinations.append(combo_dict)
    all_combinations_df = pd.DataFrame(all_combinations)
    pattern_counts = all_combinations_df.groupby(list(unique_issues.keys())).size().reset_index(name='Count')
    
    print(all_combinations_df)
    print("\n\n")
    print(pattern_counts)
    

    Output for above:

       CUSTOMER   M1   M2
    0         1  ABC  ABC
    1         1  DEF  ABC
    2         2  ABC  ABC
    3         2  DEF  ABC
    
    
    
        M1   M2  Count
    0  ABC  ABC      2
    1  DEF  ABC      2