I have a pretty crazy dictionary that I'm trying to parse out into a pandas dataframe. Here is a smaller verison of what the dictionary looks like:
import datetime
from decimal import *
test_dict = [{'record_id': '43bbdfbf',
'date': datetime.date(2023, 3, 25),
'person': {
'id': '123abc',
'name': 'Person1'
},
'venue': {
'id': '5bd6c74c',
'name': 'Place1',
'city': {
'id': '3448439',
'name': 'São Paulo',
'state': 'São Paulo',
'state_code': 'SP',
'coords': {'lat': Decimal('-23.5475'), 'long': Decimal('-46.63611111')},
'country': {'code': 'BR', 'name': 'Brazil'}
},
},
'thing_lists': {'thing_list': [
{'song': [
{'name': 'Thing1','info': None,'dup': None},
{'name': 'Thing2', 'info': None, 'dup': None},
{'name': 'Thing3', 'info': None, 'dup': None},
{'name': 'Thing4', 'info': None, 'dup': None}],
'extra': None},
{'song': [
{'name': 'ExtraThing1','info': None,'dup': None},
{'name': 'ExtraThing2', 'info': None, 'dup': None}],
'extra': 1
}]}}]
Here's a function I started building to parse out pieces of information from the dictionary:
def extract_values(dictionary):
record_id = dictionary[0]['record_id'],
date = dictionary[0]['date'],
country = dictionary[0]['venue']['city']['country']['name']
return record_id, date, venue, city, lat, long, country
Here's the piece where I attempt to pull out the pieces into a dataframe.
import pandas as pd
df = pd.DataFrame(extract_values(test_dict)).transpose()
df.rename(
columns={
df.columns[0]: 'record_id',
df.columns[1]: 'date',
df.columns[3]: 'city',
df.columns[6]: 'country'
},
inplace=True
)
As you can see it mostly works except for string fields which get split out where each row gets a single character. I'm not sure how to resolve this issue. However, it seems if the last field I pull isn't a string - then it gets squished back into place. Is there a way to push the strings together manually so I don't have to rely on the data type of the final field?
Also, the final few fields appear to be tricky to pull. Ideally, I would like my final dataframe to look like the following:
RecordID Date City Country ThingName Dup Extra
43bbdfbf 2023-03-25 São Paulo Brazil Thing1 None None
43bbdfbf 2023-03-25 São Paulo Brazil Thing2 None None
43bbdfbf 2023-03-25 São Paulo Brazil Thing3 None None
43bbdfbf 2023-03-25 São Paulo Brazil Thing4 None None
43bbdfbf 2023-03-25 São Paulo Brazil ExtraThing1 None 1
43bbdfbf 2023-03-25 São Paulo Brazil ExtraThing2 None 1
Can someone help point me in the right direction for how to correctly parse this dictionary?
I don't see an easy way around this other than a massively nested loop to extract all the values:
def extract_values(data):
records = []
for record in data:
for thing in record['thing_lists']['thing_list']:
for song in thing['song']:
records.append({
'RecordID' : record['record_id'],
'Date': record['date'],
'City': record['venue']['city']['name'],
'Country': record['venue']['city']['country']['name'],
'ThingName': song['name'],
'Dup': song['dup'],
'Extra': thing['extra']
})
return records
records = extract_values(test_dict)
df = pd.DataFrame(records)
Output:
RecordID Date City Country ThingName Dup Extra
0 43bbdfbf 2023-03-25 São Paulo Brazil Thing1 None NaN
1 43bbdfbf 2023-03-25 São Paulo Brazil Thing2 None NaN
2 43bbdfbf 2023-03-25 São Paulo Brazil Thing3 None NaN
3 43bbdfbf 2023-03-25 São Paulo Brazil Thing4 None NaN
4 43bbdfbf 2023-03-25 São Paulo Brazil ExtraThing1 None 1.0
5 43bbdfbf 2023-03-25 São Paulo Brazil ExtraThing2 None 1.0