pythonjsonserialization

How I can I lazily read multiple JSON values from a file/stream in Python?


I'd like to read multiple JSON objects from a file/stream in Python, one at a time. Unfortunately json.load() just .read()s until end-of-file; there doesn't seem to be any way to use it to read a single object or to lazily iterate over the objects.

Is there any way to do this? Using the standard library would be ideal, but if there's a third-party library I'd use that instead.

At the moment I'm putting each object on a separate line and using json.loads(f.readline()), but I would really prefer not to need to do this.

Example Use

example.py

import my_json as json
import sys

for o in json.iterload(sys.stdin):
    print("Working on a", type(o))

in.txt

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

example session

$ python3.2 example.py < in.txt
Working on a dict
Working on a int
Working on a int
Working on a list
Working on a int
Working on a int
Working on a int

Solution

  • Here's a much, much simpler solution. The secret is to try, fail, and use the information in the exception to parse correctly. The only limitation is the file must be seekable.

    def stream_read_json(fn):
        import json
        start_pos = 0
        with open(fn, 'r') as f:
            while True:
                try:
                    obj = json.load(f)
                    yield obj
                    return
                except json.JSONDecodeError as e:
                    f.seek(start_pos)
                    json_str = f.read(e.pos)
                    obj = json.loads(json_str)
                    start_pos += e.pos
                    yield obj
    

    Edit: just noticed that this will only work for Python >=3.5. For earlier, failures return a ValueError, and you have to parse out the position from the string, e.g.

    def stream_read_json(fn):
        import json
        import re
        start_pos = 0
        with open(fn, 'r') as f:
            while True:
                try:
                    obj = json.load(f)
                    yield obj
                    return
                except ValueError as e:
                    f.seek(start_pos)
                    end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)',
                                        e.args[0]).groups()[0])
                    json_str = f.read(end_pos)
                    obj = json.loads(json_str)
                    start_pos += end_pos
                    yield obj