pythonnesteddictionary-comprehensionset-comprehension

Python Set Comprehension Nested in Dict Comprehension


I have a list of tuples, where each tuple contains a string and a number in the form of:

[(string_1, num_a), (string_2, num_b), ...]

The strings are nonunique, and so are the numbers, e.g. (string_1 , num_m) or (string_9 , num_b) are likely to exist in the list.

I'm attempting to create a dictionary with the string as the key and a set of all numbers occurring with that string as the value:

dict = {string_1: {num_a, num_m}, string_2: {num_b}, ...}

I've done this somewhat successfully with the following dictionary comprehension with nested set comprehension:

#st_id_list = [(string_1, num_a), ...]
#st_dict = {string_1: {num_a, num_m}, ...} 
st_dict = {
    st[0]: set(
        st_[1]
        for st_ in st_id_list
        if st_[0] == st[0]
    )
    for st in st_id_list
}

There's only one issue: st_id_list is 18,000 items long. This snippet of code takes less than ten seconds to run for a list of 500 tuples, but over twelve minutes to run for the full 18,000 tuples. I have to think this is because I've nested a set comprehension inside a dict comprehension.

Is there a way to avoid this, or a smarter way to it?


Solution

  • You have a double loop, so you take O(N**2) time to produce your dictionary. For 500 items, 250.000 steps are taken, and for your 18k items, 324 million steps need to be done.

    Here is a O(N) loop instead, so 500 steps for your smaller dataset, 18.000 steps for the larger dataset:

    st_dict = {}
    for st, id in st_id_list:
        st_dict.setdefault(st, set()).add(id)
    

    This uses the dict.setdefault() method to ensure that for a given key (your string values), there is at least an empty set available if the key is missing, then adds the current id value to that set.

    You can do the same with a collections.defaultdict() object:

    from collections import defaultdict
    
    st_dict = defaultdict(set)
    for st, id in st_id_list:
        st_dict[st].add(id)
    

    The defaultdict() uses the factory passed in to set a default value for missing keys.

    The disadvantage of the defaultdict approach is that the object continues to produce default values for missing keys after your loop, which can hide application bugs. Use st_dict.default_factory = None to disable the factory explicitly to prevent that.