performancenumpystructured-array

Numpy structured array performance


I've got a look-up problem that boils down to the following situation. Three columns with positive integers. For some value i, which values in 'column_3' have a value in 'column_1' below i and a value in 'column_2' above i?

import numpy as np

rows = 1e6
i = 5e8

ts = np.zeros((rows,), dtype=[('column_1','int64'),('column_2','int64'),('column_3','int64')])
ts['column_1'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_2'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_3'] = np.random.randint(low=0,high=1e9,size=rows)    

This is the operation I'd like to optimize:

%%timeit
a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']

Is there anything I'm overlooking that could make this faster? Would be grateful for any advice!!


Solution

  • Assigning your 3 arrays to A,B,C at creation as well:

    In [3]: %%timeit
       ...: a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
       ...: 
    22.5 ms ± 838 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    In [4]: %%timeit
       ...: a = C[(A < i)&(B > i)]
       ...: 
       ...: 
    9.36 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Using a,b,c = ts['column_1'],ts['column_2'],ts['column_3'] instead falls in between.

    Those are variants and timings you can play with. As I can see it just minor differences due to indexing differences. Nothing like an order of magnitude difference.