algorithmset

Efficient algorithm for finding all maximal subsets


I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example:

input = [{1, 2, 3}, {1, 2}, {2, 3}, {2, 4}, {}]
output = [{1, 2, 3}, {2, 4}]

I have not been able to find a standard algorithm for this, or even a name for this problem, so I am calling it "maximal subsets" for lack of anything else. Here is an O(n^2) algorithm (in Python for concreteness), assuming is_subset_func is O(1):1

def eliminate_subsets(a, cardinality_func, is_subset_func):
    out = []
    for element in sorted(a, reverse=True, key=cardinality_func):
        for existing in out:
            if is_subset_func(element, existing):
                break
        else:
            out.append(element)
    return out

Is there a more efficient algorithm, hopefully O(n log n) or better?


1 For bit masks of constant size, as is true in my case, is_subset_func is just element & existing == element, which runs in constant time.


Solution

  • Suppose you label all the input sets.

    A={1, 2, 3}, B={1, 2}, C={2, 3}, D={2, 4}, E={}
    

    Now build intermediate sets, one per element in the universe, containing the labels of the sets where it appears:

    1={A,B}
    2={A,B,C,D}
    3={A,C}
    4={D}
    

    Now for each input set compute the intersection of all the label sets of its elements:

    For A, {A,B} intesect {A,B,C,D} intersect {A,C} = {A}   (*)
    

    If the intersection contains some label other than for the set itself, then it's s a subset of that set. Here there is no other element, so the answer is no. But,

    For C, {A,B,C,D} intersect {A,C} = {A,C}, which means that it's a subset of A.
    

    The cost of this method depends on the implementation of sets. Suppose bitmaps (as you hinted). Say there are n input sets of maximum size m and |U| items in the universe. Then the intermediate set construction produces |U| sets of size n bits, so there is O(|U|n) time to initialize them. Setting the bits requires O(nm) time. Computing each intersection as at (*) above requires O(mn); O(mn^2) for all.

    Putting all these together we have O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2). Using the same conventions, your "all pairs" algorithm is O(|U|^2 n^2). Since m <= |U|, this algorithm is asymptotically faster. It's likely to be faster in practice as well because there's no elaborate bookkeeping to add constant factors.

    Addition: On Line Version

    The OP asked if there is an online version of this algorithm, i.e. one where the set of maximal sets can be maintained incrementally as input sets arrive one-by-one. The answer seems to be yes. The intermediate sets tell us quickly if a new set is a subset of one already seen. But how to tell quickly if it's a superset? And, if so, of which existing maximal sets? For in this case those maximal sets are no longer maximal and must be replaced by the new one.

    The key is to note that A is a superset of B iff A' is a subset of B' (the tick' denoting set complement).

    Following this inspiration, we maintain the intermediate set as before. When a new input set S arrives, do the same test as described above: Let I(e) be the intermediate set for input element e. Then this test is

    For X = \intersect_{e \in S} . I(e), |X| > 0
    

    (In this case it's greater than zero rather than one as above because S is not yet in I.) If the test succeeds, then the new set is a (possibly improper) subset of an existing maximal set, so it can be discarded.

    Otherwise we must add S as a new maximal set, but before doing this, compute:

    Y = \intersect_{e \in S'} . I'(e) = ( \union_{e \in S'} . I(e) )'
    

    where again the tick' is set complement. The union form may be a bit faster to compute. Y contains the maximal sets that have been superceded by S. They must be removed from the maximal collection and from I. Finally add S as a maximal set and update I with S's elements.

    Let's work through our example. When A arrives, we add it to I and have

    1={A}  2={A}  3={A}
    

    When B arrives, we find X = {A} intersect {A} = {A}, so throw B away and continue. The same happens for C. When D arrives we find X = {A} intersect {} = {}, so continue with Y = I'(1) intersect I'(3) = {} intersect {}. This correctly tells us that maximal set A is not contained in D, so there is nothing to delete. But it must be added as a new maximal set, and I becomes

    1={A}  2={A,D}  3={A}  4={D}
    

    The arrival of E causes no change. Posit the arrival then of a new set F={2, 3, 4, 5}. We find

    X = {A} isect {A,D} isect {A} isect {D} isect {}
    

    so we cannot throw F away. Continue with

    Y = \intersect_{e in {1}} I'(e) = I'(1) = {D}
    

    This tells us D is a subset of F, so should be discarded while F is added, leaving

    1={A} 2={A,F} 3={A,F} 4={F} 5={F}
    

    The computation of the complements is both tricky and nice due to the algorithm's online nature. The universe for input complements need only include input elements seen so far. The universe for intermediate sets consists only of tags of sets in the current maximal collection. For many input streams the size of this set will stabilize or decrease over time.

    I hope this is helpful.

    Summary

    The general principle at work here is a powerful idea that crops of often in algorithm design. It's the reverse map. Whenever you find yourself doing a linear search to find an item with a given attribute, consider building a map from the attribute back to item. Often it is cheap to construct this map, and it strongly reduces search time. The premier example is a permutation map p[i] that tells you what position the i'th element will occupy after an array is permuted. If you need to search out the item that ends up in a given location a, you must search p for a, a linear time operation. On the other hand, an inverse map pi such that pi[p[i]] == i takes no longer to compute than does p (so its cost is "hidden"), but pi[a] produces the desired result in constant time.

    Implementation by Original Poster

    import collections
    import operator
    from functools import reduce # only in Python 3
    
    def is_power_of_two(n):
        """Returns True iff n is a power of two.  Assumes n > 0."""
        return (n & (n - 1)) == 0
    
    def eliminate_subsets(sequence_of_sets):
        """Return a list of the elements of `sequence_of_sets`, removing all
        elements that are subsets of other elements.  Assumes that each
        element is a set or frozenset and that no element is repeated."""
        # The code below does not handle the case of a sequence containing
        # only the empty set, so let's just handle all easy cases now.
        if len(sequence_of_sets) <= 1:
            return list(sequence_of_sets)
        # We need an indexable sequence so that we can use a bitmap to
        # represent each set.
        if not isinstance(sequence_of_sets, collections.Sequence):
            sequence_of_sets = list(sequence_of_sets)
        # For each element, construct the list of all sets containing that
        # element.
        sets_containing_element = {}
        for i, s in enumerate(sequence_of_sets):
            for element in s:
                try:
                    sets_containing_element[element] |= 1 << i
                except KeyError:
                    sets_containing_element[element] = 1 << i
        # For each set, if the intersection of all of the lists in which it is
        # contained has length != 1, this set can be eliminated.
        out = [s for s in sequence_of_sets
               if s and is_power_of_two(reduce(
                   operator.and_, (sets_containing_element[x] for x in s)))]
        return out