pythonjsonpandasjson-flattener

Python Flatten Deep Nested JSON


I have the following JSON structure:

{
  "comments_v2": [
    {
      "timestamp": 1196272984,
      "data": [
        {
          "comment": {
            "timestamp": 1196272984,
            "comment": "OSI Beach Party Weekend, CA",
            "author": "xxxx"
          }
        }
      ],
      "title": "xxxx commented on his own photo."
    },
    {
      "timestamp": 1232918783,
      "data": [
        {
          "comment": {
            "timestamp": 1232918783,
            "comment": "We'll see about that.",
            "author": "xxxx"
          }
        }
      ]
    }
  ]
}

I'm trying to flatten this JSON into a pandas dataframe and here is my solution:

# Read file
df = pd.read_json(codecs.open(infile, "r", "utf-8-sig"))

# Normalize
df = pd.json_normalize(df["comments_v2"])
child_column = pd.json_normalize(df["data"])
child_column = pd.concat([child_column.drop([0], axis=1), child_column[0].apply(pd.Series)], axis=1)
df_merge = df.join(child_column)
df_merge.drop(["data"], axis=1, inplace=True)

The resulting dataframe is as follows:

timestamp title comment.timestamp comment.comment comment.author comment.group
1196272984 xxxx commented on his own photo 1196272984 OSI Beach Party Weekend, CA XXXXX NaN

Is there a simpler way to flat the JSON to obtain the result shown above?

Thank you!


Solution

  • Use record_path='data' as argument of pd.json_normalize:

    import json
    import codecs
    
    with codecs.open(infile, 'r', 'utf-8-sig') as jsonfile:
        data = json.load(jsonfile)
        df = pd.json_normalize(data['comments_v2'], 'data')
    

    Output:

    >>> df
       comment.timestamp              comment.comment comment.author
    0         1196272984  OSI Beach Party Weekend, CA           xxxx
    1         1232918783        We'll see about that.           xxxx