python sorting numpy lexicographic lexicographic-ordering

Assign same lexicographic rank to duplicate elements of 2d array

I'm trying to lexicographically rank array components. The below code works fine, but I'd like to assign equal ranks to equal elements.

import numpy as np

values = np.asarray([
    [1, 2, 3],
    [1, 1, 1],
    [2, 2, 3],
    [1, 2, 3],
    [1, 1, 2]
])
# need to flip, because for `np.lexsort` last
# element has highest priority.
values_reversed = np.fliplr(values)
# this returns the order, i.e. the order in
# which the elements should be in a sorted
# array (not the rank by index).
order = np.lexsort(values_reversed.T)
# convert order to ranks.
n = values.shape[0]
ranks = np.empty(n, dtype=int)
# use order to assign ranks.
ranks[order] = np.arange(n)

The rank variable contains [2, 0, 4, 3, 1], but a rank array of [2, 0, 4, 2, 1] is required because elements [1, 2, 3] (index 0 and 3) share the same rank. Continuous rank numbers are ok, so [2, 0, 3, 2, 1] is also an acceptable rank array.

Solution

Here's one approach -

# Get lexsorted indices and hence sorted values by those indices
lexsort_idx = np.lexsort(values.T[::-1])
lexsort_vals = values[lexsort_idx]

# Mask of steps where rows shift (there are no duplicates in subsequent rows) 
mask = np.r_[True,(lexsort_vals[1:] != lexsort_vals[:-1]).any(1)]

# Get the stepped indices (indices shift at non duplicate rows) and
# the index values are scaled corresponding to row numbers     
stepped_idx = np.maximum.accumulate(mask*np.arange(mask.size))    

# Re-arrange the stepped indices based on the original order of rows
# This is basically same as the original code does in last 4 steps,
# just in a concise manner
out_idx = stepped_idx[lexsort_idx.argsort()]

Sample step-by-step intermediate outputs -

In [55]: values
Out[55]: 
array([[1, 2, 3],
       [1, 1, 1],
       [2, 2, 3],
       [1, 2, 3],
       [1, 1, 2]])

In [56]: lexsort_idx
Out[56]: array([1, 4, 0, 3, 2])

In [57]: lexsort_vals
Out[57]: 
array([[1, 1, 1],
       [1, 1, 2],
       [1, 2, 3],
       [1, 2, 3],
       [2, 2, 3]])

In [58]: mask
Out[58]: array([ True,  True,  True, False,  True], dtype=bool)

In [59]: stepped_idx
Out[59]: array([0, 1, 2, 2, 4])

In [60]: lexsort_idx.argsort()
Out[60]: array([2, 0, 4, 3, 1])

In [61]: stepped_idx[lexsort_idx.argsort()]
Out[61]: array([2, 0, 4, 2, 1])

Performance boost

For more performance efficiency to compute lexsort_idx.argsort(), we could use and this is identical to the original code in last 4 lines -

def argsort_unique(idx):
    # Original idea : http://stackoverflow.com/a/41242285/3293881 by @Andras
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx

Thus, lexsort_idx.argsort() could be alternatively computed with argsort_unique(lexsort_idx).

Runtime test

Applying few more optimization tricks, we would have a version like so -

def numpy_app(values):
    lexsort_idx = np.lexsort(values.T[::-1])
    lexsort_v = values[lexsort_idx]
    mask = np.concatenate(( [False],(lexsort_v[1:] == lexsort_v[:-1]).all(1) ))

    stepped_idx = np.arange(mask.size)
    stepped_idx[mask] = 0
    np.maximum.accumulate(stepped_idx, out=stepped_idx)

    return stepped_idx[argsort_unique(lexsort_idx)]

@Warren Weckesser's rankdata based method as a func for timings -

def scipy_app(values):
    v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
    return rankdata(v, method='min') - 1

Timings -

In [97]: a = np.random.randint(0,9,(10000,3))

In [98]: out1 = numpy_app(a)

In [99]: out2 = scipy_app(a)

In [100]: np.allclose(out1, out2)
Out[100]: True

In [101]: %timeit scipy_app(a)
100 loops, best of 3: 5.32 ms per loop

In [102]: %timeit numpy_app(a)
100 loops, best of 3: 1.96 ms per loop