A response.json()
returns a JSON in the following format:
{
"workbooks": [
{
"name": "WORKBOOK_A",
"embeddedDatasources": [
{
"upstreamTables": [
{"name": "WORKBOOK_A_TABLE_A"}]},
{
"upstreamTables": [
{"name": "WORKBOOK_A_TABLE_B"},
{"name": "WORKBOOK_A_TABLE_C"}]},
{
"upstreamTables": []}]},
{
"name": "WORKBOOK_B",
"embeddedDatasources": [
{
"upstreamTables": [
{"name": "WORKBOOK_B_TABLE_A"},
{"name": "WORKBOOK_B_TABLE_B"}]},
{
"upstreamTables": [
{"name": "WORKBOOK_B_TABLE_C"},
{"name": "WORKBOOK_B_TABLE_D"}]}]}]}
I am trying to convert it to a dataframe like this:
workbooks | upstreamTables |
---|---|
WORKBOOK_A | WORKBOOK_A_TABLE_A |
WORKBOOK_A | WORKBOOK_A_TABLE_B |
WORKBOOK_A | WORKBOOK_A_TABLE_C |
WORKBOOK_B | WORKBOOK_B_TABLE_A |
WORKBOOK_B | WORKBOOK_B_TABLE_B |
WORKBOOK_B | WORKBOOK_B_TABLE_C |
WORKBOOK_B | WORKBOOK_B_TABLE_D |
"upstreamTables": []
should be ignored in this case.
Playing around with json_normalize
df = pd.json_normalize(json_data)
didn't play out so far and extracting the data as separate dataframes and rejoining them seems too convulsive.
Here's one approach:
resp
(i.e., response.json()
) to pd.json_normalize
with both record_path
and meta
. Add meta_prefix
to avoid a ValueError: Conflicting metadata
. Cf. this post. Otherwise we would end up with 2 name
columns.df.rename
to rename the columns and re-order them.import pandas as pd
# resp = {...}
df = (pd.json_normalize(resp['workbooks'],
record_path=['embeddedDatasources', 'upstreamTables'],
meta='name',
meta_prefix='meta_'
)
.rename(columns={'name': 'upstreamTables',
'meta_name': 'workbooks'})
[['workbooks', 'upstreamTables']]
)
Output:
workbooks upstreamTables
0 WORKBOOK_A WORKBOOK_A_TABLE_A
1 WORKBOOK_A WORKBOOK_A_TABLE_B
2 WORKBOOK_A WORKBOOK_A_TABLE_C
3 WORKBOOK_B WORKBOOK_B_TABLE_A
4 WORKBOOK_B WORKBOOK_B_TABLE_B
5 WORKBOOK_B WORKBOOK_B_TABLE_C
6 WORKBOOK_B WORKBOOK_B_TABLE_D