I have come across an error while parsing json with ijson.
Background:
I have a series(approx - 1000) of large files of twitter data that are compressed in a '.bz2' format. I need to get elements from the file into a pd.DataFrame
for further analysis. I have identified the keys I need to get. I am cautious putting twitter data up.
Attempt:
I have managed to decompress the files using bz2.decompress
with the following code:
## Code in loop specific for decompressing and parsing -
with open(file, 'rb') as source:
# Decompress the file
json_r = bz2.decompress(source.read())
json_decom = json_r.decode('utf-8') # decompresses one file at a time rather than a stream
# Parse the JSON with ijson
parser = ijson.parse(json_decom)
for prefix, event, value in parser:
# Print selected items as part of testing
if prefix=="created_at":
print(value)
if prefix=="text":
print(value)
if prefix=="user.id_str":
print(value)
This gives the following error:
IncompleteJSONError: parse error: trailing garbage
estamp_ms":"1609466366680"} {"created_at":"Fri Jan 01 01:59
(right here) ------^
Two things:
Any assistance would be greatly appreciated.
Thank you, James
To directly answer your two questions:
The decompression method is correct in the sense that it yields JSON data that you then feed to ijson
. As you point out, ijson
works both with str
and bytes
inputs (although the latter is preferred); if you were giving ijson
some non-JSON input you wouldn't see an error showing JSON data in it.
This is a very common error that is described in ijson's FAQ. It basically means your JSON document has more than one top-level value, which is not standard JSON, but is supported by ijson by using the multiple_values
option (see docs for details).
About the code as a whole: while it's working correctly, it could be improved on: the whole point of using ijson
is that you can avoid loading the full JSON contents in memory. The code you posted doesn't use this to its advantage though: it first opens the bz-compressed file, reads it as a whole, decompresses that as a whole, (unnecessarily) decodes that as a whole, and then gives the decoded data as input to ijson
. If your input file is small, and the decompressed data is also small you won't see any impact, but if your files are big then you'll definitely start noticing it.
A better approach is to stream the data through all the operations so that everything happens incrementally: decompression, no decoding and JSON parsing. Something along the lines of:
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# ...
As the cherry on the cake, if you want to build a DataFrame from that you can use DataFrame's data argument to build the DataFrame directly with the results from the above. data
can be an iterable, so you can, for example, make the code above a generator and use it as data
. Again, something along the lines of:
def json_input():
with bz2.BZ2File(filename, mode='r') as f:
for prefix, event, value in ijson.parse(f):
# yield your results
df = pandas.DataFrame(data=json_input())