pythonjsonpandasnested-json

Is there a way to flatten a pandas column of nested json strings with a varying number of objects?


I have a pandas dataframe that is formatted similarly to this:

order_guid selections
ef23a23d-5e2c-42a0-952b-cf43732c55b8 [{"id":"7233","item":"Smoothie"},{"guid":"6317","item":"Juice"},{"guid":"6941","item":"Chips"}]
0fb69525-7b71-4dec-ab77-36a698d519f6 [{"id":"5425","item":"Wrap"}]
59b5f8e9-96f1-483b-a31d-659492765e06 [{"id":"7233","item":"Smoothie"},{"guid":"3803","item":"Water"},{"guid":"7961","item":"Salad"}]
d8656ba2-4076-4036-9515-6fea1f9e58d4 [{"id":"7961","item":"Salad"},{"guid":"5425","item":"Wrap"}]

What I've been trying to do is unnest the selections field such that each selection occupies its own row with the order_guid as a foreign key to a separate orders table, similar to this:

order_guid selection_id selection_item
ef23a23d-5e2c-42a0-952b-cf43732c55b8 7233 Smoothie
ef23a23d-5e2c-42a0-952b-cf43732c55b8 6317 Juice
ef23a23d-5e2c-42a0-952b-cf43732c55b8 6941 Chips
0fb69525-7b71-4dec-ab77-36a698d519f6 5425 Wrap
59b5f8e9-96f1-483b-a31d-659492765e06 7233 Smoothie
59b5f8e9-96f1-483b-a31d-659492765e06 3803 Water
59b5f8e9-96f1-483b-a31d-659492765e06 7961 Salad
d8656ba2-4076-4036-9515-6fea1f9e58d4 7961 Salad
d8656ba2-4076-4036-9515-6fea1f9e58d4 5425 Wrap

The main thing giving me trouble is the varying number of nested json elements. My initial instinct was that I could do something like to_json() ->json.loads()->json_normalize, but I can't make that work with the order_guid and the multiple selections. Looping seems silly. It feels like this is a simple enough problem that someone else should've encountered it, but this is the closest I've seen to the issue.

Is there a pandas method for this? Or something obvious I'm missing?


Solution

  • You can first use df.explode() before using json_normalize(). Hope this is what you meant.

    df = df.explode('selections').reset_index(drop=True)
    result_df = pd.concat([df['order_guid'], pd.json_normalize(df.selections)], axis=1)
    result_df['guid'] = result_df['guid'].fillna(result_df['id'])
    result_df = result_df[['order_guid', 'guid', 'item']].rename(columns={'guid': 'selection_guid', 'item': 'selection_item'})