I have a list of tuples, where each tuple
contains a string
and a number in the form of:
[(string_1, num_a), (string_2, num_b), ...]
The strings are nonunique, and so are the numbers, e.g. (string_1 , num_m)
or (string_9 , num_b)
are likely to exist in the list.
I'm attempting to create a dictionary with the string as the key and a set of all numbers occurring with that string as the value:
dict = {string_1: {num_a, num_m}, string_2: {num_b}, ...}
I've done this somewhat successfully with the following dictionary comprehension with nested set comprehension:
#st_id_list = [(string_1, num_a), ...]
#st_dict = {string_1: {num_a, num_m}, ...}
st_dict = {
st[0]: set(
st_[1]
for st_ in st_id_list
if st_[0] == st[0]
)
for st in st_id_list
}
There's only one issue: st_id_list
is 18,000 items long. This snippet of code takes less than ten seconds to run for a list of 500 tuples, but over twelve minutes to run for the full 18,000 tuples. I have to think this is because I've nested a set comprehension inside a dict comprehension.
Is there a way to avoid this, or a smarter way to it?
You have a double loop, so you take O(N**2) time to produce your dictionary. For 500 items, 250.000 steps are taken, and for your 18k items, 324 million steps need to be done.
Here is a O(N) loop instead, so 500 steps for your smaller dataset, 18.000 steps for the larger dataset:
st_dict = {}
for st, id in st_id_list:
st_dict.setdefault(st, set()).add(id)
This uses the dict.setdefault()
method to ensure that for a given key (your string values), there is at least an empty set available if the key is missing, then adds the current id
value to that set.
You can do the same with a collections.defaultdict()
object:
from collections import defaultdict
st_dict = defaultdict(set)
for st, id in st_id_list:
st_dict[st].add(id)
The defaultdict()
uses the factory passed in to set a default value for missing keys.
The disadvantage of the defaultdict
approach is that the object continues to produce default values for missing keys after your loop, which can hide application bugs. Use st_dict.default_factory = None
to disable the factory explicitly to prevent that.