There are loads of answers on this topic, but for the life of me, I cannot find a solution to my issue.
Say I have a JSON like
json_2_explode = [
{
"scalar": "43",
"units": "m",
"parameter": [{"no_1": "45", "no_2": "1038", "no_3": "356"}],
"name": "Foo",
},
{
"scalar": "54.1",
"units": "s",
"parameter": [{"no_1": "78", "no_2": "103", "no_3": "356"}],
"name": "Yoo",
},
{
"scalar": "1123.1",
"units": "Hz",
"parameter": [{"no_1": "21", "no_2": "43", "no_3": "3577"}],
"name": "Baz",
},
]
documenting some readings for attributes Foo
, Yoo
and Baz
. For each I detail a number, that is, the value itself, some parameters, and the name.
Say this JSON is a column in a dataframe,
df = pd.DataFrame(data = {'col1': [11, 9, 23, 1],
'col2': [7, 3, 1, 12],
'col_json' : [json_2_explode,
json_2_explode,
json_2_explode,
json_2_explode]}, index=[0, 1, 2, 3])
col1 col2 col_json
0 11 7 [{'scalar': '43', 'units': 'MPa', 'parameter':...
1 9 3 [{'scalar': '43', 'units': 'MPa', 'parameter':...
2 23 1 [{'scalar': '43', 'units': 'MPa', 'parameter':...
3 1 12 [{'scalar': '43', 'units': 'MPa', 'parameter':...
The issue I have is that if I try
df = pd.json_normalize(df['col_json'].explode())
I get
scalar units parameter name
0 43 m [{'no_1': '45', 'no_2': '1038', 'no_3': '356'}] Foo
1 54.1 s [{'no_1': '78', 'no_2': '103', 'no_3': '356'}] Yoo
2 1123.1 Hz [{'no_1': '21', 'no_2': '43', 'no_3': '3577'}] Baz
3 43 m [{'no_1': '45', 'no_2': '1038', 'no_3': '356'}] Foo
4 54.1 s [{'no_1': '78', 'no_2': '103', 'no_3': '356'}] Yoo
5 1123.1 Hz [{'no_1': '21', 'no_2': '43', 'no_3': '3577'}] Baz
6 43 m [{'no_1': '45', 'no_2': '1038', 'no_3': '356'}] Foo
7 54.1 s [{'no_1': '78', 'no_2': '103', 'no_3': '356'}] Yoo
8 1123.1 Hz [{'no_1': '21', 'no_2': '43', 'no_3': '3577'}] Baz
9 43 m [{'no_1': '45', 'no_2': '1038', 'no_3': '356'}] Foo
10 54.1 s [{'no_1': '78', 'no_2': '103', 'no_3': '356'}] Yoo
11 1123.1 Hz [{'no_1': '21', 'no_2': '43', 'no_3': '3577'}] Baz
So it is exploding each JSON in 3 rows (admittedly each JSON does contain 3 sub-dicts, so to say). I actually would like Foo
, Yoo
and Baz
to be documented in the same row, adding columns.
Is there maybe solution not involving manually manipulating rows/piercing it as desired? I would love to see one of your fancy one-liners.
EDIT
The desired outcome would look like this, column names free to assign.
So for each row the JSON is exploded and assigned to new columns. json_normalise creates a new row for each element in the list of json, as undesired behaviour in the example above
0 11 7 43 m 45 1038 356 Foo 54.1 s 78 103 356 Yoo 1123.1 Hz 21 43 3577 Baz
1 9 3 43 m 45 1038 356 Foo 54.1 s 78 103 356 Yoo 1123.1 Hz 21 43 3577 Baz
2 23 12 43 m 45 1038 356 Foo 54.1 s 78 103 356 Yoo 1123.1 Hz 21 43 3577 Baz
Note in this example each json list in the original dataframe is the same, but of course the idea here is to explode each json to its neighbouring new columns, and they could be different of course.
** EDIT 2** Attempt to implement a suggested solution by Jonatahn Leon
First I use his code to define a function, to be mapped to the column containing the json object
def json_flattener(json_blurp):
prefix_json_2_explode = {}
for d in json_blurp:
prefix_json_2_explode.update({d['name'] + '_' + key: value for key, value in d.items()})
dict_flattened = (flatten(d, '.') for d in [prefix_json_2_explode])
df = pd.DataFrame(dict_flattened)
return df
new_cols = ['Foo_scalar', 'Foo_units', 'Foo_parameter.0.no_1',
'Foo_parameter.0.no_2', 'Foo_parameter.0.no_3', 'Foo_name',
'Yoo_scalar', 'Yoo_units', 'Yoo_parameter.0.no_1',
'Yoo_parameter.0.no_2', 'Yoo_parameter.0.no_3', 'Yoo_name',
'Baz_scalar', 'Baz_units', 'Baz_parameter.0.no_1',
'Baz_parameter.0.no_2', 'Baz_parameter.0.no_3', 'Baz_name']
df[new_cols]= df[['col_json']].apply(lambda x: json_flattener(x.iloc[0]), axis=1,result_type='expand')
But I ran into the error:
ValueError: If using all scalar values, you must pass an index
Trying to sort it out.
I believe what you're looking for is to apply pd.json_normalize
individually to each JSON object and transform the result into a pd.Series
with a two-level index. The first level should be 'name'
, and the second level should consist of all other parameters from each atomic JSON item. A one-liner for this could be:
df['col_json'].apply(lambda x: pd.json_normalize(x).set_index('name').stack())
The output from the above line looks like this:
Or, to dig deeper, you have to specify the record_path
and meta
parameters of json_normalize because of the inner list that wraps the parameter dictionary:
df["col_json"].apply(
lambda x: pd.json_normalize(
data=x, record_path="parameter", meta=["scalar", "units", "name"]
)
.set_index("name")
.stack()
)
If preserving the MultiIndex
isn't important, you can simply concatenate the result with the main DataFrame
, kind of:
output = pd.concat([
df.drop(columns='col_json'),
df['col_json'].apply(lambda x: pd.json_normalize(x).set_index('name').stack())
], axis='columns')
However, if you do want to keep the MultiIndex
, you'll need to add a new column level to the original DataFrame
before concatenation. For example:
json_normalize = lambda x: (
pd.json_normalize(x, 'parameter', ['scalar', 'units', 'name'])
.set_index('name')
.stack()
)
# make a new level at the very top of column levels and fill it with 'name'
add_level = lambda df, name: pd.concat([df], keys=[name], axis=1)
output = pd.concat(
[
df.drop(columns='col_json').pipe(add_level, 'Main'),
df['col_json'].apply(json_normalize),
],
axis='columns'
)