I have a pandas dataframe that is formatted similarly to this:
order_guid | selections |
---|---|
ef23a23d-5e2c-42a0-952b-cf43732c55b8 | [{"id":"7233","item":"Smoothie"},{"guid":"6317","item":"Juice"},{"guid":"6941","item":"Chips"}] |
0fb69525-7b71-4dec-ab77-36a698d519f6 | [{"id":"5425","item":"Wrap"}] |
59b5f8e9-96f1-483b-a31d-659492765e06 | [{"id":"7233","item":"Smoothie"},{"guid":"3803","item":"Water"},{"guid":"7961","item":"Salad"}] |
d8656ba2-4076-4036-9515-6fea1f9e58d4 | [{"id":"7961","item":"Salad"},{"guid":"5425","item":"Wrap"}] |
What I've been trying to do is unnest the selections
field such that each selection occupies its own row with the order_guid
as a foreign key to a separate orders
table, similar to this:
order_guid | selection_id | selection_item |
---|---|---|
ef23a23d-5e2c-42a0-952b-cf43732c55b8 | 7233 | Smoothie |
ef23a23d-5e2c-42a0-952b-cf43732c55b8 | 6317 | Juice |
ef23a23d-5e2c-42a0-952b-cf43732c55b8 | 6941 | Chips |
0fb69525-7b71-4dec-ab77-36a698d519f6 | 5425 | Wrap |
59b5f8e9-96f1-483b-a31d-659492765e06 | 7233 | Smoothie |
59b5f8e9-96f1-483b-a31d-659492765e06 | 3803 | Water |
59b5f8e9-96f1-483b-a31d-659492765e06 | 7961 | Salad |
d8656ba2-4076-4036-9515-6fea1f9e58d4 | 7961 | Salad |
d8656ba2-4076-4036-9515-6fea1f9e58d4 | 5425 | Wrap |
The main thing giving me trouble is the varying number of nested json elements. My initial instinct was that I could do something like to_json()
->json.loads()
->json_normalize
, but I can't make that work with the order_guid
and the multiple selections. Looping seems silly. It feels like this is a simple enough problem that someone else should've encountered it, but this is the closest I've seen to the issue.
Is there a pandas method for this? Or something obvious I'm missing?
You can first use df.explode()
before using json_normalize()
. Hope this is what you meant.
df = df.explode('selections').reset_index(drop=True)
result_df = pd.concat([df['order_guid'], pd.json_normalize(df.selections)], axis=1)
result_df['guid'] = result_df['guid'].fillna(result_df['id'])
result_df = result_df[['order_guid', 'guid', 'item']].rename(columns={'guid': 'selection_guid', 'item': 'selection_item'})