pythonjsonparsingijson

Python ijson - nested parsing


I'm working with a web response of JSON that looks like this (simplified, and I can't change the format):

[
   { "type": "0","key1": 3, "key2": 5},
   { "type": "1","key3": "a", "key4": "b"},
   { "type": "2", "data": [<very big array here>] }
]

I want to do two things:

  1. Inspect the first two objects without reading everything to memory, I can do this by using Ijson:
parsed = ijson.items(res.raw, 'item')
next(parsed) # first item
next(parsed) # second item
  1. Inspect the third object without putting it all to memory. If I do next(parsed) again, all of the "data" array will be read to memory and turned into a dict, and I want to avoid it.

  2. Inspect the data array without loading it all to memory. If I didn't care about the other keys, I could do that:

parsed = ijson.items(res.raw, 'item.data.item') # iterator over data's items

The problem is, I need to do all of these on the same stream.

Ideally it would have been great to receive the third object as a file-like object that I can pass to ijson again, but that seems out of scope for that API.

I'm also fine with replacing ijson with a library that can do this better.


Solution

  • You need to use ijson's event interception mechanism. Basically go one level down in the parsing logic by using ijson.parse until you hit the big array, then switch to using ijson.items with the rest of the parse events. This uses a string literal, but should illustrate the point:

    import ijson
    
    s = b'''[
       { "type": "0","key1": 3, "key2": 5},
       { "type": "1","key3": "a", "key4": "b"},
       { "type": "2", "data": [1, 2, 3] }
    ]'''
    parse_events = ijson.parse(s)
    while True:
        path, name, value = next(parse_events)
        # do stuff with path, name, data, until...
        if name == 'map_key' and value == 'data':
            break
    for value in ijson.items(parse_events, 'item.data.item'):
        print(value)