I am trying to get data from a YAML file into a Pandas DataFrame. Take the following example data.yml
:
---
- doc: "Book1"
reviews:
- reviewer: "Paul"
stars: "5"
- reviewer: "Sam"
stars: "2"
- doc: "Book2"
reviews:
- reviewer: "John"
stars: "4"
- reviewer: "Sam"
stars: "3"
- reviewer: "Pete"
stars: "2"
...
The desired DataFrame would look like this:
doc reviews.reviewer reviews.stars
0 Book1 Paul 5
1 Book1 Sam 2
2 Book2 John 4
3 Book2 Sam 3
4 Book2 Pete 2
I've tried feeding the YAML data to Pandas different ways (like with open('data.yml') as f: data = pd.DataFrame(yaml.load(f))
), but the cells always contain the nested dicts. This solution works for general JSON data, but it's quite a bit of code and it seems like a simpler solution for YAML might exist.
Is there a built-in way to denormalize YAML for conversion to a Pandas Dataframe in this way?
You should use json_normalize
to flatten the dictionary after YAML loads:
pd.io.json.json_normalize(yaml.load(f), 'reviews', 'doc')
reviewer stars doc
0 Paul 5 Book1
1 Sam 2 Book1
2 John 4 Book2
3 Sam 3 Book2
4 Pete 2 Book2