I have an df which has 3 columns: Region, Country and AREA_CODE.
Region Country AREA_CODE AREA_SUB_CODE_1 AREA_SUB_CODE_2
===========================================================================
AMER US A1 A1_US_1 A1_US_2
AMER CANADA A1 A1_CA_1 A1_CA_2
AMER US B1 B1_US_1 B1_US_2
AMER US A1 A1_US_1 A1_US_2
Is there a way to get output list of both the AREA_SUB_CODE_1 and AREA_SUB_CODE_2 as a list under each of the previous column value. something like the below?
{
"AREA_SUB_CODE_1": {
"AMER": {
"US": {
"A1": ["A1_US_1"],
"B1": ["B1_US_1"]
},
"CANADA": {
"A1": ["A1_CA_1"],
}
}
},
"AREA_SUB_CODE_2": {
"AMER": {
"US": {
"A1": {
"A1_US_1": ["A1_US_2"]
},
"B1": {
"B1_US_1": ["B1_US_2"]
},
"CANADA": {
"A1": {
"A1_CA_1": ["A1_CA_2"],
}
}
}
},
}
So far i have tried to groupby on 3 columns it works which is,
for (k1, k2), v in df.groupby(['Region', 'Country'])['AREA_CODE']:
tTmp.setdefault(k1, {})[k2] = sorted(v.unique())
But when i try to groupby 4 columns, it is throwing error
too many values to unpack (expected 2)
for (k1, k2), v in df.groupby(['Region', 'Country', 'AREA_CODE'])['AREA_SUB_CODE_1']:
tTmp.setdefault(k1, {})[k2] = sorted(v.unique())
How to apply groupby for 4 columns and 5 columns? Or any other way to achieve this?
I think we can achieve this with the following recursive function:
f = lambda s: ({k: f(s[k]) for k in s.index.levels[0]}
if s.index.nlevels > 1
else {k: s.loc[[k]].unique().tolist()
for k in s.index.unique()})
Here, s
is expected to be a pandas.Series
with hierarchical indexing. At each indexing level, we map the keys to the corresponding depth of the resulting dictionary. At the last level, we extract unique values into a list. The double square brackets in s.loc[[k]]
ensure the output is a series, the following unique
method returns a numpy.ndarray
with unique values of the series, and tolist
converts the array into a Python list.
If we know there's exactly one unique value at the final level, we can simplify the function:
f = lambda s: {k: f(s[k]) for k in s.index.levels[0]} \
if s.index.nlevels > 1 \
else s.to_dict()
In this case, we skip creating a list at the end. But if needed, we can insert additional mapping like s.map(lambda x: [x]).to_dict()
.
Before applying any of the function above, we have to transform the data into a properly indexed series:
inner = ['Region', 'Country', 'AREA_CODE']
values = df.melt(inner).set_index(['variable', *inner]).squeeze()
Here, 'variable'
is the default name for the new column with the rest of column names excluding the inner
list after melting. The final answer is f(values)
Let's see the example:
df = pd.DataFrame({
'Region': ['AMER', 'AMER', 'AMER', 'AMER'],
'Country': ['US', 'CANADA', 'US', 'US'],
'AREA_CODE': ['A1', 'A1', 'B1', 'A1'],
'AREA_SUB_CODE_1': ['A1_US_1x', 'A1_CA_1', 'B1_US_1', 'A1_US_1y'],
'AREA_SUB_CODE_2': ['A1_US_2', 'A1_CA_2', 'B1_US_2', 'A1_US_2']})
f = lambda s: ({k: f(s[k]) for k in s.index.levels[0]}
if s.index.nlevels > 1
else {k: s.loc[[k]].unique().tolist()
for k in s.index.unique()})
inner = ['Region', 'Country', 'AREA_CODE']
values = df.melt(inner, var_name='sub_code').set_index(['sub_code', *inner]).squeeze()
answer = f(values)
Note, that in this example, we have 2 different values for the key set ('AREA_SUB_CODE_1', 'AMER', 'US', 'A1')
and 2 equal ones for the key set ('AREA_SUB_CODE_2', 'AMER', 'US', 'A1')
, so the second case will end up as a list with one value in the final answer
:
{'AREA_SUB_CODE_1': {'AMER': {'CANADA': {'A1': ['A1_CA_1']},
'US': {'A1': ['A1_US_1x', 'A1_US_1y'],
'B1': ['B1_US_1']}}},
'AREA_SUB_CODE_2': {'AMER': {'CANADA': {'A1': ['A1_CA_2']},
'US': {'A1': ['A1_US_2'], 'B1': ['B1_US_2']}}}}
If we drop the last record in the example data, then we can use the alternative function with s.to_dict()
at the end.