pythonjsonlarge-dataijson

Trying to stream my (very large) json file with ijson - is it formatted wrong?


I'm trying to stream through a large json file using ijson in python. This is my first time trying this.

my code is really simple right now:

with open('file.json', 'rb') as f:
j = ijson.items(f, 'item')

for item in j:
    print('x')

This returns a "trailing garbage" error - essentially the 2nd item in the file is considered garbage, i think because of the file format.

My json file is this one from kaggle, and is formatted like this:

{"_id":{"$oid":"6457879fd1187d621cbbba9c"},"sourceCC":"us",...etc...}
{"_id":{"$oid":"6457879fd1187d621cbddd8a"},"sourceCC":"us",...etc...}

It is about 3GB in size, so im unable to open it.

If i use 'multiple_items=True' i believe it considers all the items to be multiple values for the same item, so it does not return any error, but also does not return anything else.

What can I do?

Thanks.


Solution

  • That's not actuall a JSON document. That is a series of JSON documents concatenated using newlines. You don't need ijson to read it; you can instead read it line-by-line and use the built-in json module:

    import json
    
    with open('myfile.json') as fd:
      for line in fd:
        obj = json.loads(line)
        # do something with obj here