pythonjsonpandas

Explode nested JSON to Dataframe


There are loads of answers on this topic, but for the life of me, I cannot find a solution to my issue.

Say I have a JSON like

json_2_explode = [
    {
        "scalar": "43",
        "units": "m",
        "parameter": [{"no_1": "45", "no_2": "1038", "no_3": "356"}],
        "name": "Foo",
    },
    {
        "scalar": "54.1",
        "units": "s",
        "parameter": [{"no_1": "78", "no_2": "103", "no_3": "356"}],
        "name": "Yoo",
    },
    {
        "scalar": "1123.1",
        "units": "Hz",
        "parameter": [{"no_1": "21", "no_2": "43", "no_3": "3577"}],
        "name": "Baz",
    },
]

documenting some readings for attributes Foo, Yoo and Baz. For each I detail a number, that is, the value itself, some parameters, and the name.

Say this JSON is a column in a dataframe,

df = pd.DataFrame(data = {'col1': [11, 9, 23, 1],
                         'col2': [7, 3, 1, 12],
                         'col_json' : [json_2_explode,
                                       json_2_explode,
                                       json_2_explode,
                                       json_2_explode]}, index=[0, 1, 2, 3])
    col1    col2    col_json
0   11      7        [{'scalar': '43', 'units': 'MPa', 'parameter':...
1   9       3        [{'scalar': '43', 'units': 'MPa', 'parameter':...
2   23      1        [{'scalar': '43', 'units': 'MPa', 'parameter':...
3   1       12       [{'scalar': '43', 'units': 'MPa', 'parameter':...

The issue I have is that if I try

df = pd.json_normalize(df['col_json'].explode())

I get

    scalar  units            parameter                          name
0   43      m   [{'no_1': '45', 'no_2': '1038', 'no_3': '356'}] Foo
1   54.1    s   [{'no_1': '78', 'no_2': '103', 'no_3': '356'}]  Yoo
2   1123.1  Hz  [{'no_1': '21', 'no_2': '43', 'no_3': '3577'}]  Baz
3   43      m   [{'no_1': '45', 'no_2': '1038', 'no_3': '356'}] Foo
4   54.1    s   [{'no_1': '78', 'no_2': '103', 'no_3': '356'}]  Yoo
5   1123.1  Hz  [{'no_1': '21', 'no_2': '43', 'no_3': '3577'}]  Baz
6   43      m   [{'no_1': '45', 'no_2': '1038', 'no_3': '356'}] Foo
7   54.1    s   [{'no_1': '78', 'no_2': '103', 'no_3': '356'}]  Yoo
8   1123.1  Hz  [{'no_1': '21', 'no_2': '43', 'no_3': '3577'}]  Baz
9   43      m   [{'no_1': '45', 'no_2': '1038', 'no_3': '356'}] Foo
10  54.1    s   [{'no_1': '78', 'no_2': '103', 'no_3': '356'}]  Yoo
11  1123.1  Hz  [{'no_1': '21', 'no_2': '43', 'no_3': '3577'}]  Baz

So it is exploding each JSON in 3 rows (admittedly each JSON does contain 3 sub-dicts, so to say). I actually would like Foo, Yoo and Baz to be documented in the same row, adding columns. Is there maybe solution not involving manually manipulating rows/piercing it as desired? I would love to see one of your fancy one-liners.

EDIT

The desired outcome would look like this, column names free to assign.

So for each row the JSON is exploded and assigned to new columns. json_normalise creates a new row for each element in the list of json, as undesired behaviour in the example above

0   11  7   43  m   45  1038    356 Foo 54.1    s   78  103 356 Yoo 1123.1  Hz  21  43  3577    Baz
1   9   3   43  m   45  1038    356 Foo 54.1    s   78  103 356 Yoo 1123.1  Hz  21  43  3577    Baz
2   23  12  43  m   45  1038    356 Foo 54.1    s   78  103 356 Yoo 1123.1  Hz  21  43  3577    Baz

Note in this example each json list in the original dataframe is the same, but of course the idea here is to explode each json to its neighbouring new columns, and they could be different of course.

** EDIT 2** Attempt to implement a suggested solution by Jonatahn Leon

First I use his code to define a function, to be mapped to the column containing the json object

def json_flattener(json_blurp):
    prefix_json_2_explode = {}
    for d in json_blurp:
        prefix_json_2_explode.update({d['name'] + '_' + key: value for key, value in d.items()})
        dict_flattened = (flatten(d, '.') for d in [prefix_json_2_explode])
    df = pd.DataFrame(dict_flattened)
    return df


new_cols = ['Foo_scalar', 'Foo_units', 'Foo_parameter.0.no_1',
       'Foo_parameter.0.no_2', 'Foo_parameter.0.no_3', 'Foo_name',
       'Yoo_scalar', 'Yoo_units', 'Yoo_parameter.0.no_1',
       'Yoo_parameter.0.no_2', 'Yoo_parameter.0.no_3', 'Yoo_name',
       'Baz_scalar', 'Baz_units', 'Baz_parameter.0.no_1',
       'Baz_parameter.0.no_2', 'Baz_parameter.0.no_3', 'Baz_name']


df[new_cols]= df[['col_json']].apply(lambda x: json_flattener(x.iloc[0]), axis=1,result_type='expand')

But I ran into the error:

ValueError: If using all scalar values, you must pass an index

Trying to sort it out.


Solution

  • I believe what you're looking for is to apply pd.json_normalize individually to each JSON object and transform the result into a pd.Series with a two-level index. The first level should be 'name', and the second level should consist of all other parameters from each atomic JSON item. A one-liner for this could be:

    df['col_json'].apply(lambda x: pd.json_normalize(x).set_index('name').stack())
    

    The output from the above line looks like this:

    output 1

    Or, to dig deeper, you have to specify the record_path and meta parameters of json_normalize because of the inner list that wraps the parameter dictionary:

    df["col_json"].apply(
        lambda x: pd.json_normalize(
            data=x, record_path="parameter", meta=["scalar", "units", "name"]
        )
        .set_index("name")
        .stack()
    )
    

    output 2

    If preserving the MultiIndex isn't important, you can simply concatenate the result with the main DataFrame, kind of:

    output = pd.concat([
        df.drop(columns='col_json'),
        df['col_json'].apply(lambda x: pd.json_normalize(x).set_index('name').stack())
    ], axis='columns')
    

    However, if you do want to keep the MultiIndex, you'll need to add a new column level to the original DataFrame before concatenation. For example:

    json_normalize = lambda x: (
        pd.json_normalize(x, 'parameter', ['scalar', 'units', 'name'])
        .set_index('name')
        .stack()
    )
    
    # make a new level at the very top of column levels and fill it with 'name'
    add_level = lambda df, name: pd.concat([df], keys=[name], axis=1)
    
    output = pd.concat(
        [
            df.drop(columns='col_json').pipe(add_level, 'Main'),
            df['col_json'].apply(json_normalize),
        ], 
        axis='columns'
    )
    

    final output