pythonsetcpythonmembership

Python set memebership test


In an algorithmics course our teacher covered "virtual initialization" where you allocate memory for an operation, but don't initialize all the values since the problem space might be too large compared to the values that actually need to be calculated (for example a dictionary or set), which wastes a lot of time for setting a default value. The basic principle is to have two arrays that point to each other (index each other) and keep track of how many variables have been assigned. let's say we have an array a and we want to find if a[i] contains a valid value, we can use array b as an index to a like so: enter image description here

I had a look at the python time complexity table at https://wiki.python.org/moin/TimeComplexity and it mentions that a set membership test in the worst case can be of O(n). I'm not sure where to find the exact implementation of the set function but after a bit of googling, most people mention that it uses a hash table. My main questions are:

  1. How can a hash table be used to check if a value is valid or not? We can hash every value and read the result, but that doesn't mean the output isn't garbage (we can be reading whatever was written to that address when malloc was called).

  2. virtual initialization completely avoids the problem with collisions in a hash table, so why is it not a better solution than using a hash table?


Solution

  • When you implement a hash table using an array, you need a flag in each entry to indicate whether it's currently in use. This is needed to deal with hash function collisions -- if two values have the same hash code, you can't put them both in the same element of the array.

    This means that when you allocate the array, you have to go through it, initializing all these flags. And you have to do this again whenever you grow the hash table.

    "Virtual initialization" avoids this. The algorithm you pasted is used to tell whether an a[k] is in use. It uses a second array b that contains indexes into a. When inserting an element, a[k].p contains an index j in b, and b[j] contains k. If these two match, and also j is lower than the limit of all indexes, you know that the array element is in use.

    To clear an entry, you simply set a[k].p = 0, since all valid entries have p between 1 and n.

    This is an example of a time-space tradeoff in algorithm design. To avoid the time spent initializing the array, we allocate a second array. If you have lots of available memory, this can be an acceptable tradeoff.