pythonjsondataframeaws-lambdastruct

Converting components of Array Struct as columns in python


I'm trying to unnest array struct column in python. Datasource is a json file. I Would like to split tag column such that each key becomes a column and each value becomes its respective value for the entry. If a particular entry doesn't have a tag then corresponding value should be blank. Any suggestions on how to achieve this will be very helpful, TIA!

Current data structure:

{"Account ID": "101356", "AccountName": "Account1", "Tags": [{"Key": "AppGroupEmail", "Value": "testteam"}, {"Key": "SNOW", "Value": "TASK21002089571"}, {"Key": "ClientID", "Value": "220001"}]}
{"Account ID": "101357", "AccountName": "Account2", "Tags": [{"Key": "ClientID", "Value": "220007"}, {"Key": "BuildTicket", "Value": "TM21001412480"}, {"Key": "AccountType", "Value": "Dev"}]}

Expected outcome:

Table of data


Solution

  • This code re-writes what appears to be JSON Lines-formatted data file to a CSV with the columns described:

    import json
    import csv
    
    # Assumptions:
    # 1. "Account ID", "AccountName", and "Tags" are keys in every JSON object.
    # 2. "Tags" keys are unknown.
    # 3. Input file can be completely loaded into memory in order to collect
    #    other column names from "Tags".
    
    with open('input.jsonl') as file:
        columns = ['Account ID', 'AccountName']
        records = []
        for line in file:
            data = json.loads(line)
            for d in data['Tags'].copy():  # original data is modified with new columns.
                key, value = d['Key'], d['Value']
                data[key] = value  # Add "Tag" keys to data dict
                if key not in columns:
                    columns.append(key)  # collect "Tag" keys as column headers in order found.
            del data['Tags']      # Unneeded now
            records.append(data)  # save the updated record
    
    with open('output.csv', 'w', encoding='ascii', newline='') as file:
        writer = csv.DictWriter(file, fieldnames=columns)
        writer.writeheader()
        writer.writerows(records)
    

    output.csv:

    Account ID,AccountName,AppGroupEmail,SNOW,ClientID,BuildTicket,AccountType
    101356,Account1,testteam,TASK21002089571,220001,,
    101357,Account2,,,220007,TM21001412480,Dev
    

    Opened in Excel:

    Table matching requirements