pythonjsonijson

ijson : How to use ijson to retrieve a dict/list element (from file or from string)?


I am trying to use ijson to retrieve an element from a json dict object.

The json string is inside a file and the only thing in that file is that content:

{"categoryTreeId":"0","categoryTreeVersion":"127","categoryAspects":[1,2,3]}

(that string is very simplified but in fact is over 2GB long)

I need to help to do the following:

1/ Open that file and

2/ Use ijson to load that json data in to some object

3/ Retrieve the list "[1,2,3]" from that object

Why not just using the following simple code:

my_json = json.loads('{"categoryTreeId":"0","categoryTreeVersion":"127","categoryAspects":[1,2,3]}')
my_list = my_json['categoryAspects']

Well, you have to imagine that this "[1,2,3]" list is in fact over 2GB long , so using json.loads() will not work(it would just crash).

I tried a lot of combination (A LOT) and they all failed Here are some examples of the things that I tried

ij = ijson.items(fd,'') -> this does not give any error, the one below do

my_list = ijson.items(fd,'').next()
-> error = '_yajl2.items' object has no attribute 'next'

my_list = ijson.items(fd,'').items()
-> error = '_yajl2.items' object has no attribute 'items'

my_list = ij['categoryAspects']
-> error = '_yajl2.items' object is not subscriptable


Solution

  • This should work:

    with open('your_file.json', 'b') as f:
        for n in ijson.items(f, 'categoryAspects.item'):
            print(n)
    

    Additionally, and if you know your numbers are kind of "normal numbers", you can also pass use_float=True as an extra argument to items for extra speed (ijson.items(f, 'categoryAspects.item', use_float=True) in the code above) -- more details about it in the documentation.

    EDIT: Answering a further question: to simply get a list with all the numbers you can create one directly from the items function like so:

    with open('your_file.json', 'b') as f:
        numbers = list(ijson.items(f, 'categoryAspects.item'))
    

    Mind you that if there are too many numbers you might still run out of memory, defeating the purpose of doing a streaming parsing.

    EDIT2: An alternative to using a list is to create a numpy array with all the numbers, which should give a more compact representation in memory of all the numbers at once, in case they are needed:

    with open('your_file.json', 'b') as f:
        numbers = numpy.fromiter(
                    ijson.items(f, 'categoryAspects.item', use_float=True),
                    dtype='float' # or int, if these are integers
                  )