I have thousands of very large JSON files that I need to process on specific elements. To avoid memory overload I am using a python library called ijson which works fine when I am processing only a single element from the json file but when I try to process multiple-element at once it throughs
IncompleteJSONError: parse error: premature EOF
Partial JSON:
{
"info": {
"added": 1631536344.112968,
"started": 1631537322.81162,
"duration": 14,
"ended": 1631537337.342377
},
"network": {
"domains": [
{
"ip": "231.90.255.25",
"domain": "dns.msfcsi.com"
},
{
"ip": "12.23.25.44",
"domain": "teo.microsoft.com"
},
{
"ip": "87.101.90.42",
"domain": "www.msf.com"
}
]
}
}
Working Code: (Multiple file open)
my_file_list = [f for f in glob.glob("data/jsons/*.json")]
final_result = []
for filename in my_file_list:
row = {}
with open(filename, 'r') as f:
info = ijson.items(f, 'info')
for o in info:
row['added']= float(o.get('added'))
row['started']= float(o.get('started'))
row['duration']= o.get('duration')
row['ended']= float(o.get('ended'))
with open(filename, 'r') as f:
domains = ijson.items(f, 'network.domains.item')
domain_count = 0
for domain in domains:
domain_count+=1
row['domain_count'] = domain_count
Failure Code: (Single file open)
my_file_list = [f for f in glob.glob("data/jsons/*.json")]
final_result = []
for filename in my_file_list:
row = {}
with open(filename, 'r') as f:
info = ijson.items(f, 'info')
for o in info:
row['added']= float(o.get('added'))
row['started']= float(o.get('started'))
row['duration']= o.get('duration')
row['ended']= float(o.get('ended'))
domains = ijson.items(f, 'network.domains.item')
domain_count = 0
for domain in domains:
domain_count+=1
row['domain_count'] = domain_count
Not sure this is the reason Using python ijson to read a large json file with multiple json objects that ijson not able to work on multiple json element at once.
Also, let me know any other python package or any sample example that can handle large size json without memory issues.
I think this is happening because you've finished reading your IO stream from the file, you're at the end already, and already asking for another query.
What you can do is to reset the cursor to the 0 position before the second query:
f.seek(0)
In a comment I said that you should try json-stream
as well, but this is not an ijson
or json-stream
bug, it's a TextIO feature.
This is the equivalent of you opening the file a second time.
If you don't want to do this, then maybe you should look at iterating through every portion of the JSON, and then deciding for each object whether it has info
or network.domains.item
.