[SOLVED] Exception when trying to parse large JSON file using ijson

Exception when trying to parse large JSON file using ijson

I am trying to parse a large JSON file (16GB) using ijson but I always get the following error :

Exception has occurred: IncompleteJSONError
lexical error: invalid char in json text.
          venue" : {          "type" : NumberInt(0)      },       "yea
                     (right here) ------^
  File "C:\pyth\dblp_parser.py", line 14, in <module>
    for record in ijson.items(f, 'item', use_float=True):

My code is as follows:

with open("dblpv13.json", "rb") as f:
    for record in ijson.items(f, 'records.item', use_float=True):
        paper_id = record["_id"] #_id is only for test
        paper_id_tab.append(paper_id)

A part of my json file is as follows:

{
    "_id" : "53e99784b7602d9701f3f636",
    "title" : "Flatlined",
    "authors" : [
        {
            "_id" : "53f58b15dabfaece00f8046d",
            "name" : "Peter J. Denning",
            "org" : "ACM Education Board",
            "gid" : "5b86c72de1cd8e14a3c2b772",
            "oid" : "544bd99545ce266baef0668a",
            "orgid" : "5f71b2811c455f439fe3c58a"
        }
    ],
    "venue" : {
        "_id" : "555036f57cea80f954169e28",
        "raw" : "Commun. ACM",
        "raw_zh" : null,
        "publisher" : null,
        "type" : NumberInt(0)
    },
    "year" : NumberInt(2002),
    "keywords" : [
        "linear scale",
        "false dichotomy"
    ],
    "n_citation" : NumberInt(7),
    "page_start" : "15",
    "page_end" : "19",
    "lang" : "en",
    "volume" : "45",
    "issue" : "6",
    "issn" : "",
    "isbn" : "",
    "doi" : "10.1145/508448.508463",
    "pdf" : "",
    "url" : [
        "http://doi.acm.org/10.1145/508448.508463"
    ],
    "abstract" : "Our propensity to create linear scales between opposing alternatives creates false dichotomies that hamper our thinking and limit our action."
},

I tried to fill in records item by item but always the same error. I'm completely blocked. Please, can any body help me?

Solution

The same problem happened to me with the said dataset. ijson can't handle it. I overcame the problem by creating another dataset and then parsing the new dataset with ijson. The approach is quite simple: read the orignal dataset with simple read; remove "NumberInt(" and ")", write the result to a new json file. the code is given below.

f=open('dblpv13_clean.json')
with open('dblpv13.json','r',errors='ignore') as myFile:
  for line in myFile:
    line=line.replace("NumberInt(","").replace(")","")
    f.write(line)
f.close()

Now you can parse the new dataset with ijson as follows.

with open('dblpv13_clean.json', "r",errors='ignore') as f:
  for i, element in enumerate(ijson.items(f, "item")):
     do something....