[SOLVED] Fastest pythonic way of parsing dictionary where values are bytes stringfied json object

Fastest pythonic way of parsing dictionary where values are bytes stringfied json object

So I have a dictionary which is a hash object I'm getting from Redis, similar to the following dictionary:

source_data = {
   b'key-1': b'{"age":33,"gender":"Male"}', 
   b'key-2': b'{"age":20,"gender":"Female"}'
}

My goal is extract all the values from this dictionary and have them as a list of Python dictionaries like so:

final_data = [
   {
      'age': 33,
      'gender': 'Male'
   },

   {
      'age': 20,
      'gender': 'Female'
   }
]

I tried list comprehension with json parsing:

import json
final_data = [json.loads(a) for a in source_data.values()]

It works but for large data set, it takes too much time.

I switch to using this 3rd party json module ujson which is faster according to this benchmark, but I haven't noticed any improvement.

I tried using multi-threading :

pool = Pool()
final_data = pool.map(ujson.loads, source_data.values(), chunksize=500)

pool.close()
pool.join()

I played a bit with chunksize but the result is the same, still taking too much time.

It would be super helpful if someone can suggest another solution or improvement to previous tries, it would be ideal if I could avoid using a loop.

Solution

Assuming the values are, indeed, valid JSON, it might be faster to build a single JSON object to decode. I think it should be safe to just join the values into a single string.

>>> new_json = b'[%s]' % (b','.join(source_data.values(),)
>>> new_json
b'[{"age":33,"gender":"Male"},{"age":20,"gender":"Female"}]'
>>> json.loads(new_json)
[{'age': 33, 'gender': 'Male'}, {'age': 20, 'gender': 'Female'}]

This replaces the overhead of calling json.loads 2000+ times with the lesser overhead of a single call to b','.join and a single string-formatting operation.