Just run the code:
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
print(set(a)) # out: {frozenset({3, 4}), frozenset({1, 2})}
print(np.unique(a)) # out: [frozenset({1, 2}), frozenset({3, 4}), frozenset({1, 2})]
The first out is correct, the second is not. The problem exactly is here:
a[0]==a[-1] # out: True
But set from np.unique has 3 elements, not 2.
I used to utilize np.unique to work with duplicates for ex (using return_index=True and others). What can u advise for me to use instead np.unique for these purposes?
numpy.unique
operates by sorting, then collapsing runs of identical elements. Per the doc string:
Returns the sorted unique elements of an array.
The "sorted" part implies it's using a sort-collapse-adjacent technique (similar to what the *NIX sort | uniq
pipeline accomplishes).
The problem is that while frozenset
does define __lt__
(the overload for <
, which most Python sorting algorithms use as their basic building block), it's not using it for the purposes of a total ordering like numbers and sequences use it. It's overloaded to test "is a proper subset of" (not including direct equality). So frozenset({1,2}) < frozenset({3,4})
is False
, and so is frozenset({3,4}) > frozenset({1,2})
.
Because the expected sort invariant is broken, sorting sequences of set
-like objects produces implementation-specific and largely useless results. Uniquifying strategies based on sorting will typically fail under those conditions; one possible result is that it will find the sequence to be sorted in order or reverse order already (since each element is "less than" both the prior and subsequent elements); if it determines it to be in order, nothing changes, if it's in reverse order, it swaps the element order (but in this case that's indistinguishable from preserving order). Then it removes adjacent duplicates (since post-sort, all duplicates should be grouped together), finds none (the duplicates aren't adjacent), and returns the original data.
For frozenset
s, you probably want to use hash based uniquification, e.g. via set
or (to preserve original order of appearance on Python 3.7+), dict.fromkeys
; the latter would be simply:
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
uniqa = list(dict.fromkeys(a)) # Works on CPython/PyPy 3.6 as implementation detail, and on 3.7+ everywhere
It's also possible to use sort-based uniquification, but numpy.unique
doesn't seem to support a key
function, so it's easier to stick to Python built-in tools:
from itertools import groupby # With no key argument, can be used much like uniq command line tool
a = [frozenset({1,2}),frozenset({3,4}),frozenset({1,2})]
uniqa = [k for k, _ in groupby(sorted(a, key=sorted))]
That second line is a little dense, so I'll break it up:
sorted(a, key=sorted)
- Returns a new list
based on a
where each element is sorted based on the sorted list
form of the element (so the <
comparison actually does put like with like)groupby(...)
returns an iterator of key/group-iterator pairs. With no key
argument to groupby
, it just means each key is a unique value, and the group-iterator produces that value as many times as it was seen.[k for k, _ in ...]
Since we don't care how many times each duplicate value was seen, so we ignore the group-iterator (assigning to _
means "ignored" by convention), and have the list comprehension produce only the keys (the unique values)