pythonjsonpandasnestedtweets

Reading nested JSON file in Pandas dataframe


I have a JSON file with the following structure (it's not the complete json file, but the structure is the same):

{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
..... 

//The rest of json continues with the same structure, but referenced_tweets is not always present  

My question: How can I load this data into a dataframe with these columns: type, id(referenced_tweet id), text, created_at, author_id, and id (tweet id)?

What I could do so far: I could get the following columns:

referenced_tweets text cerated_at author_id id (tweet id)
[{'type': 'xx', 'id': 'xxx'}] xxx xxxx xxxxx xxxxxxxxxxxx

Here is the code to get the above table:

with open('Test_SampleRetweets.json') as json_file:
    data_list = json.load(json_file)

df1 = json_normalize(data_list, 'data')
df1.head()

However, I'd like to get the type and id (in referenced_tweets) in separate columns and I could get the following so far:

type id (referenced_tweet id)
xxxx xxxxxxxxxxxxxxxxxxxxxxx

and here is the code to get the above table:

df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()

What is the problem? I'd like to get everything in one table, i.e., a table similar to the first one here but with type and id in separate columns (like the 2nd table). So, the final columns should be : type, id (referenced_tweet id), text, created_at, author_id, and id (tweet id).


Solution

  • import pandas as pd
    
    
    with open('Test_SampleRetweets.json') as json_file:
        raw_data = json.load(json_file)
    
    
    data = []
    for item in raw_data["data"]:
        item["tweet_id"] = item["id"]
        item.update(item["referenced_tweets"][0])
        del item["referenced_tweets"]
        data.append(item)
    
    
    df1 = pd.DataFrame(data)
    print(df1.head())