pythondictionaryhashset

Is there `dict.setdefault` equivalent for sets?


A common pattern when working with a set is the following:

number_list = [1,5,7,2,4,4,1,3,8,5]
number_set = set()

for number in number_list:

   #we only want to process the number if we haven't already processed it
   if(number not in number_set):
       number_set.add(number)

       #do processing of 'number' here now that we know it's not a duplicate

The lines if(number not in number_set): and number_set.add(number) bug me because we're doing two hash lookups here, when realistically we should only need one.

Dictionaries have the "setdefault" operation, which solves a very similar problem: "If the key exists in the dictionary, return the value, otherwise insert this default and then return the default". If you do this naively, IE the following, you perform two hash lookups, but setdefault allows you to do it in one

if item_key in dict:
   dict[item_key].append(item_value)
else:
   dict[item_key] = [item_value]

Is there an equivalent operation for sets? Something like if(number_set.check_if_contains_and_then_add(number)): but given a much nicer name.


Solution

  • If the profiler tells you that hash lookups contribute significant runtime, then this might work around it.

    def add_value(container, value):
        oldlen = len(container)
        container.add(value)
        return len(container) != oldlen
    
    if add_value(number_set, number):
        # process number
    

    But why would that be? Perhaps due to a slow __hash__ method, although I can tell you now that (a) hashing integers isn't slow and (b) if you possibly can, it's better to make the class with the slow __hash__ cache the result instead of reducing the number of calls. Or perhaps due to a slow __eq__, which is harder to deal with. Finally if the internal lookup mechanism itself is slow, then there may not be a great deal you can do to speed your program up, because the runtime is doing hash lookups all the time, finding names in scopes.

    It would probably be nice for set.add to return a value indicating whether or not the set changed, but I think that idea runs up against a principle of the Python libraries (admittedly not universally upheld) that mutating operations don't return a value unless it's fundamental to the operation to do so. So pop() functions return a value of course, but list.sort() returns None even though it would occasionally be useful to users if it returned self.

    I suppose you could do something like this:

    def deduped(iterable):
        seen = set()
        count = 0
        for value in iterable:
            seen.add(value)
            if count != len(seen):
                count += 1
                yield value
    
    for number in deduped(number_list):
        # process number
    

    Of course it's pure speculation that the repeated hash lookup is any kind of problem: I would normally write either of those functions with the if not in test as in your original code, and the purpose of the function would be to simplify the calling code, not to avoid superfluous hash lookups.