pythonjsonijson

Exception when trying to parse large JSON file using ijson


I am trying to parse a large JSON file (16GB) using ijson but I always get the following error :

Exception has occurred: IncompleteJSONError
lexical error: invalid char in json text.
          venue" : {          "type" : NumberInt(0)      },       "yea
                     (right here) ------^
  File "C:\pyth\dblp_parser.py", line 14, in <module>
    for record in ijson.items(f, 'item', use_float=True):

My code is as follows:

with open("dblpv13.json", "rb") as f:
    for record in ijson.items(f, 'records.item', use_float=True):
        paper_id = record["_id"] #_id is only for test
        paper_id_tab.append(paper_id) 

A part of my json file is as follows:

{
    "_id" : "53e99784b7602d9701f3f636",
    "title" : "Flatlined",
    "authors" : [
        {
            "_id" : "53f58b15dabfaece00f8046d",
            "name" : "Peter J. Denning",
            "org" : "ACM Education Board",
            "gid" : "5b86c72de1cd8e14a3c2b772",
            "oid" : "544bd99545ce266baef0668a",
            "orgid" : "5f71b2811c455f439fe3c58a"
        }
    ],
    "venue" : {
        "_id" : "555036f57cea80f954169e28",
        "raw" : "Commun. ACM",
        "raw_zh" : null,
        "publisher" : null,
        "type" : NumberInt(0)
    },
    "year" : NumberInt(2002),
    "keywords" : [
        "linear scale",
        "false dichotomy"
    ],
    "n_citation" : NumberInt(7),
    "page_start" : "15",
    "page_end" : "19",
    "lang" : "en",
    "volume" : "45",
    "issue" : "6",
    "issn" : "",
    "isbn" : "",
    "doi" : "10.1145/508448.508463",
    "pdf" : "",
    "url" : [
        "http://doi.acm.org/10.1145/508448.508463"
    ],
    "abstract" : "Our propensity to create linear scales between opposing alternatives creates false dichotomies that hamper our thinking and limit our action."
},

I tried to fill in records item by item but always the same error. I'm completely blocked. Please, can any body help me?


Solution

  • The same problem happened to me with the said dataset. ijson can't handle it. I overcame the problem by creating another dataset and then parsing the new dataset with ijson. The approach is quite simple: read the orignal dataset with simple read; remove "NumberInt(" and ")", write the result to a new json file. the code is given below.

    f=open('dblpv13_clean.json')
    with open('dblpv13.json','r',errors='ignore') as myFile:
      for line in myFile:
        line=line.replace("NumberInt(","").replace(")","")
        f.write(line)
    f.close()
    

    Now you can parse the new dataset with ijson as follows.

    with open('dblpv13_clean.json', "r",errors='ignore') as f:
      for i, element in enumerate(ijson.items(f, "item")):
         do something....