pythonnumpysetfrozenset

numpy.unique has the problem with frozensets


Just run the code:

a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
print(set(a))  # out: {frozenset({3, 4}), frozenset({1, 2})}
print(np.unique(a)) # out: [frozenset({1, 2}), frozenset({3, 4}), frozenset({1, 2})]

The first out is correct, the second is not. The problem exactly is here:

a[0]==a[-1] # out: True

But set from np.unique has 3 elements, not 2.

I used to utilize np.unique to work with duplicates for ex (using return_index=True and others). What can u advise for me to use instead np.unique for these purposes?


Solution

  • numpy.unique operates by sorting, then collapsing runs of identical elements. Per the doc string:

    Returns the sorted unique elements of an array.

    The "sorted" part implies it's using a sort-collapse-adjacent technique (similar to what the *NIX sort | uniq pipeline accomplishes).

    The problem is that while frozenset does define __lt__ (the overload for <, which most Python sorting algorithms use as their basic building block), it's not using it for the purposes of a total ordering like numbers and sequences use it. It's overloaded to test "is a proper subset of" (not including direct equality). So frozenset({1,2}) < frozenset({3,4}) is False, and so is frozenset({3,4}) > frozenset({1,2}).

    Because the expected sort invariant is broken, sorting sequences of set-like objects produces implementation-specific and largely useless results. Uniquifying strategies based on sorting will typically fail under those conditions; one possible result is that it will find the sequence to be sorted in order or reverse order already (since each element is "less than" both the prior and subsequent elements); if it determines it to be in order, nothing changes, if it's in reverse order, it swaps the element order (but in this case that's indistinguishable from preserving order). Then it removes adjacent duplicates (since post-sort, all duplicates should be grouped together), finds none (the duplicates aren't adjacent), and returns the original data.

    For frozensets, you probably want to use hash based uniquification, e.g. via set or (to preserve original order of appearance on Python 3.7+), dict.fromkeys; the latter would be simply:

    a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
    uniqa = list(dict.fromkeys(a))  # Works on CPython/PyPy 3.6 as implementation detail, and on 3.7+ everywhere
    

    It's also possible to use sort-based uniquification, but numpy.unique doesn't seem to support a key function, so it's easier to stick to Python built-in tools:

    from itertools import groupby  # With no key argument, can be used much like uniq command line tool
    
    a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
    uniqa = [k for k, _ in groupby(sorted(a, key=sorted))]
    

    That second line is a little dense, so I'll break it up:

    1. sorted(a, key=sorted) - Returns a new list based on a where each element is sorted based on the sorted list form of the element (so the < comparison actually does put like with like)
    2. groupby(...) returns an iterator of key/group-iterator pairs. With no key argument to groupby, it just means each key is a unique value, and the group-iterator produces that value as many times as it was seen.
    3. [k for k, _ in ...] Since we don't care how many times each duplicate value was seen, so we ignore the group-iterator (assigning to _ means "ignored" by convention), and have the list comprehension produce only the keys (the unique values)