pythontwittererror-handlingbz2ijson

Python ijson - parse error: trailing garbage // bz2.decompress()


I have come across an error while parsing json with ijson.

Background: I have a series(approx - 1000) of large files of twitter data that are compressed in a '.bz2' format. I need to get elements from the file into a pd.DataFrame for further analysis. I have identified the keys I need to get. I am cautious putting twitter data up.

Attempt: I have managed to decompress the files using bz2.decompress with the following code:

## Code in loop specific for decompressing and parsing - 

with open(file, 'rb') as source:
                # Decompress the file
                json_r = bz2.decompress(source.read())
                json_decom =  json_r.decode('utf-8') # decompresses one file at a time rather than a stream
                
                # Parse the JSON with ijson 
                parser = ijson.parse(json_decom)
                for prefix, event, value in parser:
                    # Print selected items as part of testing
                    if prefix=="created_at":
                        print(value)
                    if prefix=="text":
                        print(value)
                    if prefix=="user.id_str":
                        print(value)

This gives the following error:

IncompleteJSONError: parse error: trailing garbage
          estamp_ms":"1609466366680"}  {"created_at":"Fri Jan 01 01:59
                     (right here) ------^

Two things:

Any assistance would be greatly appreciated.

Thank you, James


Solution

  • To directly answer your two questions:

    About the code as a whole: while it's working correctly, it could be improved on: the whole point of using ijson is that you can avoid loading the full JSON contents in memory. The code you posted doesn't use this to its advantage though: it first opens the bz-compressed file, reads it as a whole, decompresses that as a whole, (unnecessarily) decodes that as a whole, and then gives the decoded data as input to ijson. If your input file is small, and the decompressed data is also small you won't see any impact, but if your files are big then you'll definitely start noticing it.

    A better approach is to stream the data through all the operations so that everything happens incrementally: decompression, no decoding and JSON parsing. Something along the lines of:

    with bz2.BZ2File(filename, mode='r') as f:
        for prefix, event, value in ijson.parse(f):
            # ...
    

    As the cherry on the cake, if you want to build a DataFrame from that you can use DataFrame's data argument to build the DataFrame directly with the results from the above. data can be an iterable, so you can, for example, make the code above a generator and use it as data. Again, something along the lines of:

    def json_input():
       with bz2.BZ2File(filename, mode='r') as f:
           for prefix, event, value in ijson.parse(f):
               # yield your results
    
    df = pandas.DataFrame(data=json_input())