pythonperformance

What is the fastest way to read in a large yaml file containing lists of lists?


I have a number of yaml files I need to read in which contain lists of list. Here is a way to make some example data:

from time import time
import random
import yaml

# First make a list of lists
N = 2**17
lol = []
for _ in range(N):
    lol.append([random.uniform(0, 2) for _ in range(10)])

# Write the list of lists to a yaml file
with open('data.yml', 'w') as outfile:
    yaml.dump(lol, outfile, default_flow_style=True)

I want to read them in as quickly as possible. Pyyaml is unfortunately slow.

# Now time how long it takes to read it back in
t = time()
with open("data.yml", "r") as f:
    lol = yaml.safe_load(f)
    print(f"Reading took {round(time()-t, 2)} seconds")

This give over 60 seconds for me. The file is 27MB in size.

Is there a faster way to read in a yaml fie of exactly this format?


Solution

  • YAML is a superset of JSON, and coincidentally your data is also valid JSON (a list of lists of numbers).

    Thus, using json.load() seems to be the simplest way:

    from time import perf_counter
    import json
    
    t = perf_counter()
    
    with open("data.yml", "r") as f:
        lol = json.load(f)
        print(perf_counter() - t)
    
    $ python read.py
    0.44149563400003444