arraysnumpyadditionindicesnumpy-ufunc

numpy.add.at slower than in-place add?


Proceeding from one of my earlier posts, I noticed that the operation np.add.at(A, indices, B) is a lot slower than A[indices] += B.

def fast(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        retval[slice(i,i+n)] += A[i, :]
    return retval
def slow(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    for i in range(n):
        np.add.at(retval, slice(i,i+n), A[i, :])
    return retval
def slower(A):
    n = A.shape[0]
    retval = np.zeros(2*n-1)
    indices = np.arange(n)
    indices = indices[:,None] + indices
    np.add.at(retval, indices, A) # bottleneck here
    return retval

My timings:

A = np.random.randn(10000, 10000)

timeit(lambda: fast(A), number=10) # 0.8852798199995959
timeit(lambda: slow(A), number=10) # 56.633683917999406
timeit(lambda: slower(A), number=10) # 57.763389584000834

Clearly, using the __iadd__ is a lot faster. However, the documentation for np.add.at states:

Performs unbuffered in place operation on operand ‘a’ for elements specified by ‘indices’. For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.


Why is np.add.at so slow?

What is the use-case for this function?


Solution

  • add.at was intended for cases where indices contain duplicates and += does not produce the desired result

    In [44]: A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
    In [45]: A[idx]+=1
    In [46]: A
    Out[46]: array([1, 1, 1, 1, 0])    # the duplicates in idx are ignored
    

    With add.at:

    In [47]: A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
    In [48]: np.add.at(A, idx, 1)
    In [49]: A
    Out[49]: array([1, 2, 3, 4, 0])
    

    Same result as with an explicit iteration:

    In [50]: A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
    In [51]: for i in idx: A[i]+=1
    In [52]: A
    Out[52]: array([1, 2, 3, 4, 0])
    

    Some timings:

    In [53]: %%timeit A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
        ...: A[idx]+=1
    3.65 µs ± 13.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    
    In [54]: %%timeit A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
        ...: np.add.at(A, idx, 1)
    6.47 µs ± 24.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    
    In [55]: %%timeit A = np.zeros(5,int); idx = np.array([0,1,1,2,2,2,3,3,3,3])
        ...: np.add.at(A, idx, 1)
        ...: for i in idx: A[i]+=1
    15.6 µs ± 41.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    add.at is slower than += but better than the python iteration.

    We could test variants such as A[:4]+1, A[:4]+=1, etc.

    Anyways, don't use the add.at variant if you don't need it.

    edit

    Your example, simplified a bit:

    In [108]: x = np.zeros(2*10-1)
         ...: for i in range(10):
         ...:     x[i:i+10] += 1
         ...: 
    In [109]: x
    Out[109]: 
    array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.,  9.,  8.,  7.,
            6.,  5.,  4.,  3.,  2.,  1.])
    

    So you are adding values to overlapping slices. We could replace the slices with an array:

    In [110]: x = np.zeros(2*10-1)
         ...: for i in range(10):
         ...:     x[np.arange(i,i+10)] += 1
         ...: 
    In [111]: x
    Out[111]: 
    array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.,  9.,  8.,  7.,
            6.,  5.,  4.,  3.,  2.,  1.])
    

    No need to add your 'slow' case, add.at with slices because the indices don't have duplicates.

    Creating all the indexes. += does not work because of buffering

    In [112]: idx=np.arange(10); idx=(idx[:,None]+idx).ravel()
    In [113]: y=np.zeros(2*10-1)
         ...: y[idx]+=1
    In [114]: y
    Out[114]: 
    array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
           1., 1.])
    

    add.at solves that:

    In [115]: y=np.zeros(2*10-1)
         ...: np.add.at(y, idx, 1)
    In [116]: y
    Out[116]: 
    array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.,  9.,  8.,  7.,
            6.,  5.,  4.,  3.,  2.,  1.])
    

    And the full python iteration:

    In [117]: y=np.zeros(2*10-1)
         ...: for i in idx: y[i]+=1
    In [118]: 
    In [118]: y
    Out[118]: 
    array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.,  9.,  8.,  7.,
            6.,  5.,  4.,  3.,  2.,  1.])
    

    Now some timings.

    The base line:

    In [119]: %%timeit
         ...: x = np.zeros(2*10-1)
         ...: for i in range(10):
         ...:     x[i:i+10] += 1
         ...: 
    50.5 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Advanced indexing slows this down some:

    In [120]: %%timeit
         ...: x = np.zeros(2*10-1)
         ...: for i in range(10):
         ...:     x[np.arange(i,i+10)] += 1
         ...: 
    75.2 µs ± 79.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    If it worked, one advanced-indexing += would be fastest:

    In [121]: %%timeit
         ...: idx=np.arange(10); idx=(idx[:,None]+idx).ravel()
         ...: y=np.zeros(2*10-1)
         ...: y[idx]+=1
         ...: 
         ...: 
    17.5 µs ± 693 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Full python iteration is about the same as the looped arange case:

    In [122]: %%timeit
         ...: idx=np.arange(10); idx=(idx[:,None]+idx).ravel()
         ...: y=np.zeros(2*10-1)
         ...: for i in idx: y[i]+=1
         ...: 
         ...: 
    76.3 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    add.at is slower than the flat +=, but still better than the base line:

    In [123]: %%timeit
         ...: idx=np.arange(10); idx=(idx[:,None]+idx).ravel()
         ...: y=np.zeros(2*10-1)
         ...: np.add.at(y, idx,1)
         ...: 
         ...: 
    29.4 µs ± 21.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    In my smaller test, your slower does best. But it's possible that it does not scale as well as base/fast. Your indices is much larger. Often for very large arrays, iteration on one dimension is faster due to reduced memory management load.