pythonsortingnumpylexicographiclexicographic-ordering

Assign same lexicographic rank to duplicate elements of 2d array


I'm trying to lexicographically rank array components. The below code works fine, but I'd like to assign equal ranks to equal elements.

import numpy as np

values = np.asarray([
    [1, 2, 3],
    [1, 1, 1],
    [2, 2, 3],
    [1, 2, 3],
    [1, 1, 2]
])
# need to flip, because for `np.lexsort` last
# element has highest priority.
values_reversed = np.fliplr(values)
# this returns the order, i.e. the order in
# which the elements should be in a sorted
# array (not the rank by index).
order = np.lexsort(values_reversed.T)
# convert order to ranks.
n = values.shape[0]
ranks = np.empty(n, dtype=int)
# use order to assign ranks.
ranks[order] = np.arange(n)

The rank variable contains [2, 0, 4, 3, 1], but a rank array of [2, 0, 4, 2, 1] is required because elements [1, 2, 3] (index 0 and 3) share the same rank. Continuous rank numbers are ok, so [2, 0, 3, 2, 1] is also an acceptable rank array.


Solution

  • Here's one approach -

    # Get lexsorted indices and hence sorted values by those indices
    lexsort_idx = np.lexsort(values.T[::-1])
    lexsort_vals = values[lexsort_idx]
    
    # Mask of steps where rows shift (there are no duplicates in subsequent rows) 
    mask = np.r_[True,(lexsort_vals[1:] != lexsort_vals[:-1]).any(1)]
    
    # Get the stepped indices (indices shift at non duplicate rows) and
    # the index values are scaled corresponding to row numbers     
    stepped_idx = np.maximum.accumulate(mask*np.arange(mask.size))    
    
    # Re-arrange the stepped indices based on the original order of rows
    # This is basically same as the original code does in last 4 steps,
    # just in a concise manner
    out_idx = stepped_idx[lexsort_idx.argsort()]
    

    Sample step-by-step intermediate outputs -

    In [55]: values
    Out[55]: 
    array([[1, 2, 3],
           [1, 1, 1],
           [2, 2, 3],
           [1, 2, 3],
           [1, 1, 2]])
    
    In [56]: lexsort_idx
    Out[56]: array([1, 4, 0, 3, 2])
    
    In [57]: lexsort_vals
    Out[57]: 
    array([[1, 1, 1],
           [1, 1, 2],
           [1, 2, 3],
           [1, 2, 3],
           [2, 2, 3]])
    
    In [58]: mask
    Out[58]: array([ True,  True,  True, False,  True], dtype=bool)
    
    In [59]: stepped_idx
    Out[59]: array([0, 1, 2, 2, 4])
    
    In [60]: lexsort_idx.argsort()
    Out[60]: array([2, 0, 4, 3, 1])
    
    In [61]: stepped_idx[lexsort_idx.argsort()]
    Out[61]: array([2, 0, 4, 2, 1])
    

    Performance boost

    For more performance efficiency to compute lexsort_idx.argsort(), we could use and this is identical to the original code in last 4 lines -

    def argsort_unique(idx):
        # Original idea : http://stackoverflow.com/a/41242285/3293881 by @Andras
        n = idx.size
        sidx = np.empty(n,dtype=int)
        sidx[idx] = np.arange(n)
        return sidx
    

    Thus, lexsort_idx.argsort() could be alternatively computed with argsort_unique(lexsort_idx).


    Runtime test

    Applying few more optimization tricks, we would have a version like so -

    def numpy_app(values):
        lexsort_idx = np.lexsort(values.T[::-1])
        lexsort_v = values[lexsort_idx]
        mask = np.concatenate(( [False],(lexsort_v[1:] == lexsort_v[:-1]).all(1) ))
    
        stepped_idx = np.arange(mask.size)
        stepped_idx[mask] = 0
        np.maximum.accumulate(stepped_idx, out=stepped_idx)
    
        return stepped_idx[argsort_unique(lexsort_idx)]
    

    @Warren Weckesser's rankdata based method as a func for timings -

    def scipy_app(values):
        v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1])))
        return rankdata(v, method='min') - 1
    

    Timings -

    In [97]: a = np.random.randint(0,9,(10000,3))
    
    In [98]: out1 = numpy_app(a)
    
    In [99]: out2 = scipy_app(a)
    
    In [100]: np.allclose(out1, out2)
    Out[100]: True
    
    In [101]: %timeit scipy_app(a)
    100 loops, best of 3: 5.32 ms per loop
    
    In [102]: %timeit numpy_app(a)
    100 loops, best of 3: 1.96 ms per loop