pythonjsonpandasdictionarydatetime

How to Parse Out Nested Dictionaries into Pandas DataFrame


I have a pretty crazy dictionary that I'm trying to parse out into a pandas dataframe. Here is a smaller verison of what the dictionary looks like:

import datetime
from decimal import *

test_dict = [{'record_id': '43bbdfbf',
  'date': datetime.date(2023, 3, 25),
  'person': {
      'id': '123abc',
      'name': 'Person1'
  },
  'venue': {
      'id': '5bd6c74c',
      'name': 'Place1',
      'city': {
          'id': '3448439',
          'name': 'São Paulo',
          'state': 'São Paulo',
          'state_code': 'SP',
          'coords': {'lat': Decimal('-23.5475'), 'long': Decimal('-46.63611111')},
          'country': {'code': 'BR', 'name': 'Brazil'}
      },
   },
  'thing_lists': {'thing_list': [
      {'song': [
          {'name': 'Thing1','info': None,'dup': None},
          {'name': 'Thing2', 'info': None, 'dup': None},
          {'name': 'Thing3', 'info': None, 'dup': None},
          {'name': 'Thing4', 'info': None, 'dup': None}],
         'extra': None},
     {'song': [
          {'name': 'ExtraThing1','info': None,'dup': None},
          {'name': 'ExtraThing2', 'info': None, 'dup': None}],
         'extra': 1
     }]}}]

Here's a function I started building to parse out pieces of information from the dictionary:

def extract_values(dictionary):
    record_id = dictionary[0]['record_id'],
    date = dictionary[0]['date'],
    country = dictionary[0]['venue']['city']['country']['name']
    
    return record_id, date, venue, city, lat, long, country

Here's the piece where I attempt to pull out the pieces into a dataframe.

import pandas as pd
df = pd.DataFrame(extract_values(test_dict)).transpose()
df.rename(
    columns={
        df.columns[0]: 'record_id',
        df.columns[1]: 'date',
        df.columns[3]: 'city',
        df.columns[6]: 'country'
    }, 
    inplace=True
)

As you can see it mostly works except for string fields which get split out where each row gets a single character. I'm not sure how to resolve this issue. However, it seems if the last field I pull isn't a string - then it gets squished back into place. Is there a way to push the strings together manually so I don't have to rely on the data type of the final field?

Also, the final few fields appear to be tricky to pull. Ideally, I would like my final dataframe to look like the following:

RecordID Date       City      Country ThingName    Dup   Extra
43bbdfbf 2023-03-25 São Paulo Brazil  Thing1       None  None
43bbdfbf 2023-03-25 São Paulo Brazil  Thing2       None  None
43bbdfbf 2023-03-25 São Paulo Brazil  Thing3       None  None 
43bbdfbf 2023-03-25 São Paulo Brazil  Thing4       None  None
43bbdfbf 2023-03-25 São Paulo Brazil  ExtraThing1  None  1
43bbdfbf 2023-03-25 São Paulo Brazil  ExtraThing2  None  1

Can someone help point me in the right direction for how to correctly parse this dictionary?


Solution

  • I don't see an easy way around this other than a massively nested loop to extract all the values:

    def extract_values(data):
        records = []
        for record in data:
            for thing in record['thing_lists']['thing_list']:
                for song in thing['song']:
                    records.append({ 
                        'RecordID' : record['record_id'],
                        'Date': record['date'],
                        'City': record['venue']['city']['name'],
                        'Country': record['venue']['city']['country']['name'],
                        'ThingName': song['name'],
                        'Dup': song['dup'],
                        'Extra': thing['extra']
                    })
        return records
    
    records = extract_values(test_dict)
    df = pd.DataFrame(records)
    

    Output:

       RecordID        Date       City Country    ThingName   Dup  Extra
    0  43bbdfbf  2023-03-25  São Paulo  Brazil       Thing1  None    NaN
    1  43bbdfbf  2023-03-25  São Paulo  Brazil       Thing2  None    NaN
    2  43bbdfbf  2023-03-25  São Paulo  Brazil       Thing3  None    NaN
    3  43bbdfbf  2023-03-25  São Paulo  Brazil       Thing4  None    NaN
    4  43bbdfbf  2023-03-25  São Paulo  Brazil  ExtraThing1  None    1.0
    5  43bbdfbf  2023-03-25  São Paulo  Brazil  ExtraThing2  None    1.0