I'm looking for a vectorized function that returns a mask with values of True if the value in the array has been seen before and False otherwise.
I'm looking for the fastest solution possible as speed is very important.
For example this is what I would like to see:
array = [1, 2, 1, 2, 3]
mask = [False, False, True, True, False]
So is_duplicate = array[mask]
should return [1, 2]
.
Is there a fast, vectorized way to do this? Thanks!
Approach #1 : With sorting
def mask_firstocc(a):
sidx = a.argsort(kind='stable')
b = a[sidx]
out = np.r_[False,b[:-1] == b[1:]][sidx.argsort()]
return out
We can use array-assignment
to boost perf. further -
def mask_firstocc_v2(a):
sidx = a.argsort(kind='stable')
b = a[sidx]
mask = np.r_[False,b[:-1] == b[1:]]
out = np.empty(len(a), dtype=bool)
out[sidx] = mask
return out
Sample run -
In [166]: a
Out[166]: array([2, 1, 1, 0, 0, 4, 0, 3])
In [167]: mask_firstocc(a)
Out[167]: array([False, False, True, False, True, False, True, False])
Approach #2 : With np.unique(..., return_index)
We can leverage np.unique
with its return_index
which seems to return the first occurence of each unique elemnent, hence a simple array-assignment and then indexing works -
def mask_firstocc_with_unique(a):
mask = np.ones(len(a), dtype=bool)
mask[np.unique(a, return_index=True)[1]] = False
return mask