pythonpython-3.xgeneratorijson

ijson kvitems unexpected behaviour


I'm using ijson to parse through large JSONs. I have this code, which should give me a dict of values corresponding to the relevant JSON fields:

def parse_kvitems(kv_gen, key_list):
    results = {}
    for key in key_list:
        results[key] = (v for k, v in kv_gen if k == key)
    return results

with zipfile.ZipFile(fr'{directory}\{file}', 'r') as zipObj:
    # Get a list of all archived file names from the zip
    listOfFileNames = zipObj.namelist()
    # Iterate over the file names
    for fileName in listOfFileNames:
        # Check filename endswith csv.  dont extract, ijson wants bytes input and json.loads can run into memory issues with smash jsons.No documentation available 
        if fileName.endswith('.json'):
            # Extract a single file from zip
            with zipObj.open(fileName) as f:

                #HERE:
                records = ijson.kvitems(f, 'records.item')
                data_list = ['id', 'features', 'modules', 'dbxrefs', 'description']    
                parsed_records = parse_kvitems(records, data_list) --> give me a dict of dict values that fall under the json headings in data_list

I think the kvitems object is acting like a generator and only making it through one run-through (I get the expected values for 'id', but the other data_list keys in parsed_records are empty).

To get around this I tried to make a list of duplicate kv_gen's:

def parse_kvitems(kv_gen, key_list):
    kv_list = [kv_gen] * len(key_list) #this bit
    results = {}
    for key, kv_gen in zip(key_list, kv_list):
        results[key] = (v for k, v in kv_gen if k == key)
    return results

This gave me the same error. I think mutability may be a culprit here, but I can't use copy on the kvitems object to see if this fixes it.

I then tried to use itertools.cycle(), but this seems to be working in a way I don't understand:

def parse_kvitems(kv_gen, key_list):
    infinite_kvitems = itertools.cycle(kv_gen)
    results = {}
    for key in key_list:
        results[key] = (v for k, v in infinite_kvitems if k == key)
    return results

Also, the below works (in the sense it gives me values that match what I see when I load a JSON with json.load()):

records = ijson.kvitems(f, 'records.item')
ids = (v for k, v in records if k == 'id')
features = (v for k, v in records if k == 'features')
modules = (v for k, v in records if k == 'modules')

I'm just interested in why my function doesn't, especially when the records object is being run through multiple times above...


Edit for Rodrigo

You are not showing however how you find that your final dictionary has values for id but not for the other keys. I'm assuming it's only because you are iterating over the values under the parse_records['id'] values first. As you do so, the generator expression is then evaluated and the underlying kvitems generator is exhausted.

Yup, this is correct - I was converting each val to a list to check each key had a generator containing the same number of items, as I was worried a downstream zip operation might truncate some values if they had more objects than the smallest generator.

I didn't convert to a list in the function as I thought a generator would be a better object to return (less memory intensive etc), which I could then convert to a list of I needed to outside the function.

You say that your finally piece of code works as expected. This is the only bit that surprises me, specially if you really, really inspected (i.e., evaluated) all three of the generator expressions after you created them. If you could clarify if that's the case it would be interesting; otherwise if you created all three generator expressions, but then evaluated one or the other, then there are no surprises here (because of the "About result collection" explanation).

Basically, it gave me the values I was expecting when I ran through the items as a zipped collection of generators and appended the items to a list. But this might need some more investigation, the JSONs are quite convoluted so I might have missed something.


Solution

  • About result collection

    Beware of how you are collecting the results from kvitems. In all your examples above you are using a generator expression, which are themselves lazy-evaluated, and this may lead to misunderstandings. You are not showing however how you find that your final dictionary has values for id but not for the other keys. I'm assuming it's only because you are iterating over the values under the parse_records['id'] values first. As you do so, the generator expression is then evaluated and the underlying kvitems generator is exhausted. When you iterate over the values of the other generator expressions, the underlying kvitems generator that feeds them is exhausted so they yield nothing. However, if you were to iterate over the values for one of the other keys first, you should see values for that key and not for the others.

    Generator expressions themselves are great, but in this case it might end up adding confusion. If you want to avoid this situation you may want to consolidate those sequences to be lists instead (e.g., using [... for k, v in kvitems ...] instead of (... for k, v in kvitems ...)).

    About kvitems

    As you point out, kvitems is a single-pass generator (or a single-pass asynchronous generator when fed with an asynchronous file-like object), so once you fully iterate over it, further iterations yield no values. This is why indeed in your original code you get values for id but not for the other keys that are collected on subsequent iterations over an already-iterated kvitems object.

    Trying to duplicate the kvitems object is also bogus: as you also found out, you are simply creating a list with the same object in all positions instead of copies of the original object.

    Trying to copy the kvitems is simply not possible. The only option to get a N "copies" is to actually construct N different object; this means however that the input file will be read N times (and needs to be opened N times as well, as kvitems will advance the given file until it doesn't have any more input). Possible, but not great.

    The result of itertools.cycle is an infinite generator. Then you use this as the basis to construct different generator expressions (so, lazy evaluated). You mention that this solution worked in ways "you don't understand", but don't delve on what exactly happened. My expectation is that when trying to inspect the values for any of the keys, you run into an infinite loop because your generator expression is iterating over an infinite generator, or something similar.

    You say that your finally piece of code works as expected. This is the only bit that surprises me, specially if you really, really inspected (i.e., evaluated) all three of the generator expressions after you created them. If you could clarify if that's the case it would be interesting; otherwise if you created all three generator expressions, but then evaluated one or the other, then there are no surprises here (because of the "About result collection" explanation).

    How to tackle your problem

    It basically all boils down to doing a single iteration over kvitems. You could try for instance something like this:

    def parse_kvitems(kvitems, keys):
        results = collections.defaultdict(list)
        for k, v in kvitems:
            if k in keys:
                results[k].append(v)
        return results
    

    That should do it, I think.