pythonpandasdictionarynestedmelt

How to flatten a nested json strings dataframe columns


Here is my example data with two fields where the last one [outbreak] is a pandas series.

Start:

Start

Goal (Excel mock-up):

Goal

Reproduction Code:

import pandas as pd
import json

d = {'report_id': [100, 101], 'outbreak': [
    '{"outbreak_100":{"name":"Chris","disease":"A-Pox"},"outbreak_101":{"name":"Stacy","disease": "H-Pox"}}', 
    '{"outbreak_200":{"name":"Brandon","disease":"C-Pox"},"outbreak_201":{"name":"Karen","disease": "G-Pox"},"outbreak_202":{"name":"Tim","disease": "Z-Pox"}}']}

df = pd.DataFrame(data=d)
print(type(df['outbreak']))
display(df)

#Ignore
df = pd.json_normalize(df['outbreak'].apply(json.loads), max_level=0)
display(df)

Attempts: I thought about using json_normalize() which would convert every [outbreak_id] to its own field and then use pandas.wide_to_long() to get my final output. It works in testing but my concern is that my actual production data is so long and nested that it ends up generating hundred of thousands of fields before pivoting. That does not sounds good to me and why I also hope to avoid loop iterations.

I also thought about using df = df.explode('outbreak') but I am getting a KeyError: 0.

Perhaps someone has a better idea than I do?


Solution

  • First we need to convert the json strings into dictionaries. Then we can convert the column into a dictionary, back into a dataframe and stack it. That makes the original wide dataframe into a long one as desired.

    import json
    d = {'report_id': [100, 101], 'outbreak': [
        '{"outbreak_100":{"name":"Chris","disease":"A-Pox"},"outbreak_101":{"name":"Stacy","disease": "H-Pox"}}', 
        '{"outbreak_200":{"name":"Brandon","disease":"C-Pox"},"outbreak_201":{"name":"Karen","disease": "G-Pox"},"outbreak_202":{"name":"Tim","disease": "Z-Pox"}}']}
    
    df = pd.DataFrame(d)
    # use json.loads to parse the json and construct df from it
    df = (
        pd.DataFrame(
            df.set_index('report_id')['outbreak'].map(json.loads).to_dict()
        )
        .stack()
        .rename_axis(['outbreak_id', 'report_id'], axis=0)
        .reset_index(name='outbreak_value')
    )
    
        outbreak_id  report_id                           outbreak_value
    0  outbreak_100        100    {'name': 'Chris', 'disease': 'A-Pox'}
    1  outbreak_101        100    {'name': 'Stacy', 'disease': 'H-Pox'}
    2  outbreak_200        101  {'name': 'Brandon', 'disease': 'C-Pox'}
    3  outbreak_201        101    {'name': 'Karen', 'disease': 'G-Pox'}
    4  outbreak_202        101      {'name': 'Tim', 'disease': 'Z-Pox'}