pythonpandasdictionary-comprehension

Creating columns from a column that contains a list of dictionaries


I have a dataframe that has a column with a list of dictionaries that look like this object

[{'MetricName': 'test:mean_wQuantileLoss',
  'Value': 1.0935583114624023,
  'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())},
 {'MetricName': 'train:loss:batch',
  'Value': 3.0625627040863037,
  'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
 {'MetricName': 'train:progress',
  'Value': 100.0,
  'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
 {'MetricName': 'train:loss',
  'Value': 3.2942464351654053,
  'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
 {'MetricName': 'train:final_loss',
  'Value': 3.2942464351654053,
  'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
 {'MetricName': 'train:throughput',
  'Value': 385.56353759765625,
  'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
 {'MetricName': 'test:RMSE',
  'Value': 22.101428985595703,
  'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())},
 {'MetricName': 'ObjectiveMetric',
  'Value': 22.101428985595703,
  'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())}]

I want to create columns for each MetricName and what the Value is. I have other columns in the dataframe that I want to keep in tact as well. How do I achieve this?

Here is a sample dataframe

data = {'TrainingJobName': ['Training_JOB_NAME1'],
       'TrainingJobArn': ["Blahblah"],
       'FinalMetricDataList': ["[{'MetricName': 'test:mean_wQuantileLoss', 'Value': 1.0935583114624023, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())}, {'MetricName': 'train:loss:batch', 'Value': 3.0625627040863037, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'train:progress', 'Value': 100.0, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'train:loss', 'Value': 3.2942464351654053, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'train:final_loss', 'Value': 3.2942464351654053, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'train:throughput', 'Value': 385.56353759765625, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'test:RMSE', 'Value': 22.101428985595703, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())}, {'MetricName': 'ObjectiveMetric', 'Value': 22.101428985595703, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())}]"]}
df_sample = pd.DataFrame(data=data)
df_sample.head()

Solution

  • You can use DataFrame.explode for convert list of dictionaries to rows, json_normalize for new columns, pivot them by DataFrame.pivot and append to existing data by DataFrame.join:

    import datetime as datetime
    
    data = {'TrainingJobName': ['Training_JOB_NAME1'],
           'TrainingJobArn': ["Blahblah"],
           'FinalMetricDataList': [[{'MetricName': 'test:mean_wQuantileLoss', 'Value': 1.0935583114624023, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6)}, {'MetricName': 'train:loss:batch', 'Value': 3.0625627040863037, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'train:progress', 'Value': 100.0, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'train:loss', 'Value': 3.2942464351654053, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'train:final_loss', 'Value': 3.2942464351654053, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'train:throughput', 'Value': 385.56353759765625, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'test:RMSE', 'Value': 22.101428985595703, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6)}, {'MetricName': 'ObjectiveMetric', 'Value': 22.101428985595703, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6)}]]}
    df_sample = pd.DataFrame(data=data)
    #print (df_sample)
    
    df = df_sample.explode('FinalMetricDataList')
    out = pd.json_normalize(df['FinalMetricDataList']).assign(idx = df.index.tolist())
    
    out = df_sample.join(out.pivot(index='idx', columns='MetricName', values='Value'))
    print (out)
    0  Training_JOB_NAME1       Blahblah   
    
                                     FinalMetricDataList  ObjectiveMetric  \
    0  [{'MetricName': 'test:mean_wQuantileLoss', 'Va...        22.101429   
    
       test:RMSE  test:mean_wQuantileLoss  train:final_loss  train:loss  \
    0  22.101429                 1.093558          3.294246    3.294246   
    
       train:loss:batch  train:progress  train:throughput  
    0          3.062563           100.0        385.563538