I am trying to import a deeply nested JSON into pandas dataframe. Here is the structure of the JSON file (this is only the first record (retweets[:1]):
[{'lang': 'en',
'author_id': '1076979440372965377',
'reply_settings': 'everyone',
'entities': {'mentions': [{'start': 3,
'end': 17,
'username': 'Terry81987010',
'url': '',
'location': 'Florida',
'entities': {'description': {'hashtags': [{'start': 29,
'end': 32,
'tag': '2A'}]}},
'created_at': '2019-02-01T23:01:11.000Z',
'protected': False,
'public_metrics': {'followers_count': 520,
'following_count': 567,
'tweet_count': 34376,
'listed_count': 1},
'name': "Terry's Take",
'verified': False,
'id': '1091471553437593605',
'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
{'start': 19,
'end': 32,
'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}]},
'conversation_id': '1253462541881106433',
'created_at': '2020-04-23T23:15:32.000Z',
'id': '1253462541881106433',
'possibly_sensitive': False,
'referenced_tweets': [{'type': 'retweeted',
'id': '1253052684489437184',
'in_reply_to_user_id': '91882544',
'attachments': {'media_keys': ['3_1253052312144293888',
'3_1253052620937277442'],
'media': [{}, {}]},
'entities': {'annotations': [{'start': 126,
'end': 128,
'probability': 0.514,
'type': 'Organization',
'normalized_text': 'CDC'},
{'start': 145,
'end': 146,
'probability': 0.5139,
'type': 'Place',
'normalized_text': 'NY'}],
'mentions': [{'start': 0,
'end': 13,
'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}],
'urls': [{'start': 187,
'end': 210,
'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
'display_url': 'pic.twitter.com/H4NpN5ZMkW'},
{'start': 187,
'end': 210,
'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
'display_url': 'pic.twitter.com/H4NpN5ZMkW'}]},
'lang': 'en',
'author_id': '1091471553437593605',
'reply_settings': 'everyone',
'conversation_id': '1253050942716551168',
'created_at': '2020-04-22T20:06:55.000Z',
'possibly_sensitive': False,
'referenced_tweets': [{'type': 'replied_to', 'id': '1253050942716551168'}],
'public_metrics': {'retweet_count': 208,
'reply_count': 57,
'like_count': 402,
'quote_count': 38},
'source': 'Twitter Web App',
'text': "@DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according to the CDC. And of course NY getting paid for each cv death $60,000",
'context_annotations': [{'domain': {'id': '10',
'name': 'Person',
'description': 'Named people in the world like Nelson Mandela'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
{'domain': {'id': '35',
'name': 'Politician',
'description': 'Politicians in the world, like Joe Biden'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
'author': {'url': '',
'username': 'Terry81987010',
'location': 'Florida',
'entities': {'description': {'hashtags': [{'start': 29,
'end': 32,
'tag': '2A'}]}},
'created_at': '2019-02-01T23:01:11.000Z',
'protected': False,
'public_metrics': {'followers_count': 520,
'following_count': 567,
'tweet_count': 34376,
'listed_count': 1},
'name': "Terry's Take",
'verified': False,
'id': '1091471553437593605',
'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
'in_reply_to_user': {'username': 'DineshDSouza',
'location': 'United States',
'entities': {'url': {'urls': [{'start': 0,
'end': 23,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
'description': {'urls': [{'start': 80,
'end': 103,
'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
'created_at': '2009-11-22T22:32:41.000Z',
'protected': False,
'public_metrics': {'followers_count': 1748832,
'following_count': 5355,
'tweet_count': 65674,
'listed_count': 6966},
'name': "Dinesh D'Souza",
'verified': True,
'pinned_tweet_id': '1393309917239562241',
'id': '91882544',
'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}}],
'public_metrics': {'retweet_count': 208,
'reply_count': 0,
'like_count': 0,
'quote_count': 0},
'source': 'Twitter for iPhone',
'text': "RT @Terry81987010: @DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according t…",
'context_annotations': [{'domain': {'id': '10',
'name': 'Person',
'description': 'Named people in the world like Nelson Mandela'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
{'domain': {'id': '35',
'name': 'Politician',
'description': 'Politicians in the world, like Joe Biden'},
'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
'author': {'url': '',
'username': 'set1952',
'location': 'Etats-Unis',
'created_at': '2018-12-23T23:14:42.000Z',
'protected': False,
'public_metrics': {'followers_count': 103,
'following_count': 44,
'tweet_count': 44803,
'listed_count': 0},
'name': 'SunSet1952',
'verified': False,
'id': '1076979440372965377',
'description': '',
'profile_image_url': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png'},
'__twarc': {'url': 'https://api.twitter.com/2/tweets/search/all?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=500&query=retweets_of%3ATerry81987010&start_time=2020-03-09T00%3A00%3A00%2B00%3A00&end_time=2020-04-24T00%3A00%3A00%2B00%3A00',
'version': '2.0.8',
'retrieved_at': '2021-05-17T17:13:17+00:00'}},
Here is my code:
retweets = []
for line in open('Data/usersRetweetsFlatten_sample.json', 'r'):
retweets.append(json.loads(line))
df = json_normalize(
retweets, 'referenced_tweets', ['referenced_tweets', 'type'],
meta_prefix= ".",
errors='ignore'
)
df[['author_id', 'type', '.type', 'id', 'in_reply_to_user_id', 'referenced_tweets']].head()
Here is the resulting dataframe:
As you can see, the column referenced_tweets
is not flattened yet (please note that there are two different referenced_tweets
arrays in my JSON file: one is in a deeper level insdide the other "referenced_tweets"). For example, the one at the higher level return this:
>>> retweets[0]["referenced_tweets"][0]["type"]
"retweeted"
and the one in the deeper level return this:
>>> retweets[0]["referenced_tweets"][0]["referenced_tweets"][0]["type"]
'replied_to'
QUESTION: I was wondering how I can flatten the deeper referenced_tweets
. I want to have two separate columns as referenced_tweets.type
and referenced_tweets.id
, where the value of the column referenced_tweets.type
in the above example should be replied_to
.
I think the issue here is that your data is double nested... there is a key referenced_tweets
within referenced_tweets
.
import json
from pandas import json_normalize
with open("flatten.json", "r") as file:
data = json.load(file)
df = json_normalize(
data,
record_path=["referenced_tweets", "referenced_tweets"],
meta=[
"author_id",
# ["author", "username"], # not possible
# "author", # possible but not useful
["referenced_tweets", "id"],
["referenced_tweets", "type"],
["referenced_tweets", "in_reply_to_user_id"],
["referenced_tweets", "in_reply_to_user", "username"],
]
)
print(df)
See also: https://stackoverflow.com/a/37668569/42659
Note: Above code will fail if second nested referenced_tweet
is missing.
Edit: Alternatively you could further normalize your data (which you already partly normalized with your code) in your question with an additional manual iteration. See example below. Note: Code is not optimized and may be slow depending on the amount of data.
# load your `data` with `json.load()` or `json.loads()`
df = json_normalize(
data,
record_path="referenced_tweets",
meta=["referenced_tweets", "type"],
meta_prefix= ".",
errors="ignore",
)
columns = [*df.columns, "_type", "_id"]
normalized_data = []
def append(row, type, id):
normalized_row = [*row.to_list(), type, id]
normalized_data.append(normalized_row)
for _, row in df.iterrows():
# a list/array is expected
if type(row["referenced_tweets"]) is list:
for tweet in row["referenced_tweets"]:
append(row, tweet["type"], tweet["id"])
# if empty list
else:
append(row, None, None)
else:
append(row, None, None)
enhanced_df = pd.DataFrame(data=normalized_data, columns=columns)
enhanced_df.drop("referenced_tweets", 1)
print(enhanced_df)
Edit 2: referenced_tweets
should be an array. However, if there is no referenced tweet, the Twitter API seems to omit referenced_tweets
completely. In that case, the cell value is NaN
(float) instead of an empty list. I updated the code above to take that into account.