pythonarraysnumpydeduplication

Generating numpy array of indices for a deduplicated set of points


I have an array of a minimum of 10s of thousands of points (up to 3 billion) some of which are duplicated. I'd like to deduplicate the points and generate an index array which retains the original sequence of the duplicated points.

For example:

x = [(0, 0),  # (x1, y1)
     (1, 0),  # (x2, y2)
     (1, 1),  # (x3, y3)
     (0, 0)]  # (x4, y4)

Deduplicating x, we have y:

y = list(set(x)) = [(1, 0),  # (x2, y2)
                    (0, 0),  # (x1, y1) and (x4, y4)
                    (1, 1)]  # (x3, y3)

And then we would have a resulting index array, z:

z = [1,  # (x1, y1) 
     0,  # (x2, y2)
     2,  # (x3, y3)
     1]  # (x4, y4)

Is there a numpy-like way of obtaining z? Here's a brute-force implementation:

z = []
for each_point in x:
    index = y.index(each_point)
    z.append(index)

Solution

  • x2 = np.ascontiguousarray(x).view(np.dtype((np.void, x.dtype.itemsize * x.shape[1])))
    y_temp, z = np.unique(x2, return_inverse=True)
    y = y_temp.view(dtype='int64').reshape(len(y_temp), 2)
    print(y)
    print(z)
    

    yields

    [[0 0]
     [1 0]
     [1 1]]
    

    and

    [0 1 2 0]
    

    Credit: Find unique rows in numpy.array