pythonpandasdataframe

Groupby a df column based on more than 3 columns


I have an df which has 3 columns: Region, Country and AREA_CODE.

Region      Country     AREA_CODE   AREA_SUB_CODE_1     AREA_SUB_CODE_2
===========================================================================
AMER       US             A1            A1_US_1           A1_US_2
AMER       CANADA         A1            A1_CA_1           A1_CA_2
AMER       US             B1            B1_US_1           B1_US_2
AMER       US             A1            A1_US_1           A1_US_2

Is there a way to get output list of both the AREA_SUB_CODE_1 and AREA_SUB_CODE_2 as a list under each of the previous column value. something like the below?

{
    "AREA_SUB_CODE_1": {
        "AMER": {
                "US": {
                    "A1": ["A1_US_1"],
                    "B1": ["B1_US_1"]
                },
                "CANADA": {
                    "A1": ["A1_CA_1"],
                }
            }
    },
    "AREA_SUB_CODE_2": {
        "AMER": {
                "US": {
                    "A1": {
                        "A1_US_1": ["A1_US_2"]
                    },
                    "B1": {
                        "B1_US_1": ["B1_US_2"]
                },
                "CANADA": {
                    "A1": {
                        "A1_CA_1": ["A1_CA_2"],
                        }
                }
            }
    },
}

So far i have tried to groupby on 3 columns it works which is,

for (k1, k2), v in df.groupby(['Region', 'Country'])['AREA_CODE']:
    tTmp.setdefault(k1, {})[k2] = sorted(v.unique())
 

But when i try to groupby 4 columns, it is throwing error

too many values to unpack (expected 2)

for (k1, k2), v in df.groupby(['Region', 'Country', 'AREA_CODE'])['AREA_SUB_CODE_1']:
    tTmp.setdefault(k1, {})[k2] = sorted(v.unique())

How to apply groupby for 4 columns and 5 columns? Or any other way to achieve this?


Solution

  • I think we can achieve this with the following recursive function:

    f = lambda s: ({k: f(s[k]) for k in s.index.levels[0]} 
                   if s.index.nlevels > 1 
                   else {k: s.loc[[k]].unique().tolist() 
                         for k in s.index.unique()})   
    

    Here, s is expected to be a pandas.Series with hierarchical indexing. At each indexing level, we map the keys to the corresponding depth of the resulting dictionary. At the last level, we extract unique values into a list. The double square brackets in s.loc[[k]] ensure the output is a series, the following unique method returns a numpy.ndarray with unique values of the series, and tolist converts the array into a Python list.

    If we know there's exactly one unique value at the final level, we can simplify the function:

    f = lambda s: {k: f(s[k]) for k in s.index.levels[0]} \
                  if s.index.nlevels > 1 \
                  else s.to_dict()
    

    In this case, we skip creating a list at the end. But if needed, we can insert additional mapping like s.map(lambda x: [x]).to_dict().

    Before applying any of the function above, we have to transform the data into a properly indexed series:

    inner = ['Region', 'Country', 'AREA_CODE']
    values = df.melt(inner).set_index(['variable', *inner]).squeeze()
    

    Here, 'variable' is the default name for the new column with the rest of column names excluding the inner list after melting. The final answer is f(values)

    Let's see the example:

    df = pd.DataFrame({
        'Region': ['AMER', 'AMER', 'AMER', 'AMER'],
        'Country': ['US', 'CANADA', 'US', 'US'],
        'AREA_CODE': ['A1', 'A1', 'B1', 'A1'],
        'AREA_SUB_CODE_1': ['A1_US_1x', 'A1_CA_1', 'B1_US_1', 'A1_US_1y'],
        'AREA_SUB_CODE_2': ['A1_US_2', 'A1_CA_2', 'B1_US_2', 'A1_US_2']})
    
    f = lambda s: ({k: f(s[k]) for k in s.index.levels[0]} 
                   if s.index.nlevels > 1 
                   else {k: s.loc[[k]].unique().tolist() 
                         for k in s.index.unique()}) 
      
    inner = ['Region', 'Country', 'AREA_CODE']
    values = df.melt(inner, var_name='sub_code').set_index(['sub_code', *inner]).squeeze()
    answer = f(values)
    

    Note, that in this example, we have 2 different values for the key set ('AREA_SUB_CODE_1', 'AMER', 'US', 'A1') and 2 equal ones for the key set ('AREA_SUB_CODE_2', 'AMER', 'US', 'A1'), so the second case will end up as a list with one value in the final answer:

    {'AREA_SUB_CODE_1': {'AMER': {'CANADA': {'A1': ['A1_CA_1']},
                                  'US': {'A1': ['A1_US_1x', 'A1_US_1y'],
                                         'B1': ['B1_US_1']}}},
     'AREA_SUB_CODE_2': {'AMER': {'CANADA': {'A1': ['A1_CA_2']},
                                  'US': {'A1': ['A1_US_2'], 'B1': ['B1_US_2']}}}}
    

    If we drop the last record in the example data, then we can use the alternative function with s.to_dict() at the end.