I have a dataframe that has a column with a list of dictionaries that look like this object
[{'MetricName': 'test:mean_wQuantileLoss',
'Value': 1.0935583114624023,
'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())},
{'MetricName': 'train:loss:batch',
'Value': 3.0625627040863037,
'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
{'MetricName': 'train:progress',
'Value': 100.0,
'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
{'MetricName': 'train:loss',
'Value': 3.2942464351654053,
'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
{'MetricName': 'train:final_loss',
'Value': 3.2942464351654053,
'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
{'MetricName': 'train:throughput',
'Value': 385.56353759765625,
'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())},
{'MetricName': 'test:RMSE',
'Value': 22.101428985595703,
'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())},
{'MetricName': 'ObjectiveMetric',
'Value': 22.101428985595703,
'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())}]
I want to create columns for each MetricName and what the Value is. I have other columns in the dataframe that I want to keep in tact as well. How do I achieve this?
Here is a sample dataframe
data = {'TrainingJobName': ['Training_JOB_NAME1'],
'TrainingJobArn': ["Blahblah"],
'FinalMetricDataList': ["[{'MetricName': 'test:mean_wQuantileLoss', 'Value': 1.0935583114624023, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())}, {'MetricName': 'train:loss:batch', 'Value': 3.0625627040863037, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'train:progress', 'Value': 100.0, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'train:loss', 'Value': 3.2942464351654053, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'train:final_loss', 'Value': 3.2942464351654053, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'train:throughput', 'Value': 385.56353759765625, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37, tzinfo=tzlocal())}, {'MetricName': 'test:RMSE', 'Value': 22.101428985595703, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())}, {'MetricName': 'ObjectiveMetric', 'Value': 22.101428985595703, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6, tzinfo=tzlocal())}]"]}
df_sample = pd.DataFrame(data=data)
df_sample.head()
You can use DataFrame.explode
for convert list of dictionaries to rows, json_normalize
for new columns, pivot them by DataFrame.pivot
and append to existing data by DataFrame.join
:
import datetime as datetime
data = {'TrainingJobName': ['Training_JOB_NAME1'],
'TrainingJobArn': ["Blahblah"],
'FinalMetricDataList': [[{'MetricName': 'test:mean_wQuantileLoss', 'Value': 1.0935583114624023, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6)}, {'MetricName': 'train:loss:batch', 'Value': 3.0625627040863037, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'train:progress', 'Value': 100.0, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'train:loss', 'Value': 3.2942464351654053, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'train:final_loss', 'Value': 3.2942464351654053, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'train:throughput', 'Value': 385.56353759765625, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 44, 37)}, {'MetricName': 'test:RMSE', 'Value': 22.101428985595703, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6)}, {'MetricName': 'ObjectiveMetric', 'Value': 22.101428985595703, 'Timestamp': datetime.datetime(2022, 10, 20, 7, 45, 6)}]]}
df_sample = pd.DataFrame(data=data)
#print (df_sample)
df = df_sample.explode('FinalMetricDataList')
out = pd.json_normalize(df['FinalMetricDataList']).assign(idx = df.index.tolist())
out = df_sample.join(out.pivot(index='idx', columns='MetricName', values='Value'))
print (out)
0 Training_JOB_NAME1 Blahblah
FinalMetricDataList ObjectiveMetric \
0 [{'MetricName': 'test:mean_wQuantileLoss', 'Va... 22.101429
test:RMSE test:mean_wQuantileLoss train:final_loss train:loss \
0 22.101429 1.093558 3.294246 3.294246
train:loss:batch train:progress train:throughput
0 3.062563 100.0 385.563538