pythonnumpyscipy

Numpy Array: Efficiently find matching indices


I have two lists, one of which is massive (millions of elements), the other several thousand. I want to do the following

bigArray=[0,1,0,2,3,2,,.....]

smallArray=[0,1,2,3,4]

for i in len(smallArray):
  pts=np.where(bigArray==smallArray[i])
  #Do stuff with pts...

The above works, but is slow. Is there any way to do this more efficiently without resorting to writing something in C?


Solution

  • In your case you may benefit from presorting your big array. Here is the example demonstrating how you can reduce the time from ~ 45 seconds to 2 seconds (on my laptop)(for one particular set of lengths of the arrays 5e6 vs 1e3). Obviously the solution won't be optimal if the array sizes will be wastly different. E.g. with the default solution the complexity is O(bigN*smallN), but for my suggested solution it is O((bigN+smallN)*log(bigN))

    import numpy as np, numpy.random as nprand, time, bisect
    
    bigN = 5e6
    smallN = 1000
    maxn = 1e7
    nprand.seed(1)  
    bigArr = nprand.randint(0, maxn, size=bigN)
    smallArr = nprand.randint(0, maxn, size=smallN)
    
    # brute force 
    t1 = time.time()
    for i in range(len(smallArr)):
        inds = np.where(bigArr == smallArr[i])[0]
    t2 = time.time()
    print "Brute", t2-t1
    
    # not brute force (like nested loop with index scan)
    t1 = time.time()
    sortedind = np.argsort(bigArr)
    sortedbigArr = bigArr[sortedind]
    for i in range(len(smallArr)):
        i1 = bisect.bisect_left(sortedbigArr, smallArr[i])
        i2 = bisect.bisect_right(sortedbigArr, smallArr[i])
        inds = sortedind[i1:i2]
    t2=time.time()
    print "Non-brute", t2-t1
    

    Output:

    Brute 42.5278530121

    Non-brute 1.57193303108