pythonarraysalgorithmnumpynumpy-ndarray

How to Mark Repeated Entries as True Starting from the Second Occurrence Using NumPy?


Problem

I have a NumPy array and need to identify repeated elements, marking the second occurrence and beyond as True, while keeping the first occurrence as False.

For example, given the following array:

np.random.seed(100)
a = np.random.randint(0, 5, 10)
# Output: [0 0 3 0 2 4 2 2 2 2]

I want to get the following output:

[False True False True False False True True True True]

How can I achieve this using NumPy functions only, without using any loops or extra libraries?

What did you try and what were you expecting?

I was able to get it working with a loop, but I wanted to solve it using only NumPy functions. I tried implementing np.cumsum with masks, but I couldn’t make much progress.

Here's the solution I came up with using one loop:

np.random.seed(100)
a = np.random.randint(0, 5, 10)
print(a)
uniques, first_indices = np.unique(a, return_index=True)
all_occurrences = np.zeros_like(a, dtype=bool)
for i in range(len(a)):
    all_occurrences[i] = np.any(a[:i] == a[i])

all_occurrences[first_indices] = False
print(all_occurrences)

Solution

  • To reveal that the problem is actually less complicated than it may seem at first glance, the question could be rephrased as follows: Mark all first occurrences of values with False.

    This leads to a bit of a simplified version of EuanG's answer¹:

    def find_repeated(a):
        mask = np.ones_like(a, dtype=bool)
        mask[np.unique(a, return_index=True)[-1]] = False
        return mask
    

    Steps: (1) Initialize the result mask as an all-True array of appropriate shape. (2) Find the indices of the first occurrences in the given array a. (3) Only set these indices to False in the result mask.

    To make the code also work with n-dimensional arrays, we need to add an extra step of unraveling the result of np.unique(), as it returns the indices into the flattened given array a:

    def find_repeated(a):
        mask = np.ones_like(a, dtype=bool)
        mask[np.unravel_index(np.unique(a, return_index=True)[-1], a.shape)] = False
        return mask
    

    In either case:

    ¹) Yes, I find EuanG's answer perfectly acceptable as well. No, I did not downvote it.