pythonnumpy

How to sum values based on a second index array in a vectorized manner


Let' say I have a value array

values = np.array([0.0, 1.0, 2.0, 3.0, 4.0])

and an index array

indices = np.array([0,1,0,2,2])

Is there a vectorized way to sum the values for each unique index in indices? I mean a vectorized version to get sums in this snippet:

sums = np.zeros(np.max(indices)+1)
for index, value in zip(indices, values):
    sums[index] += value

Bonus points, if the solution allows values (and in consequence sums)to be multi-dimensional.

EDIT: I benchmarked the posted solutions:

import numpy as np
import time
import pandas as pd


values = np.arange(1_000_000, dtype=float)
rng = np.random.default_rng(0)
indices = rng.integers(0, 1000, size=1_000_000)


N = 100


now = time.time_ns()
for _ in range(N):
    sums = np.bincount(indices, weights=values, minlength=1000)
print(f"np.bincount: {(time.time_ns() - now) * 1e-6 / N:.3f} ms")


now = time.time_ns()
for _ in range(N):
    sums = np.zeros(1 + np.amax(indices), dtype=values.dtype)
    np.add.at(sums, indices, values)
print(f"np.add.at: {(time.time_ns() - now) * 1e-6 / N:.3f} ms")


now = time.time_ns()
for _ in range(N):
    pd.Series(values).groupby(indices).sum().values
print(f"pd.groupby: {(time.time_ns() - now) * 1e-6 / N:.3f} ms")


now = time.time_ns()
for _ in range(N):
    sums = np.zeros(np.max(indices)+1)
    for index, value in zip(indices, values):
        sums[index] += value
print(f"Loop: {(time.time_ns() - now) * 1e-6 / N:.3f} ms")

Results:

np.bincount: 1.129 ms
np.add.at: 0.763 ms
pd.groupby: 5.215 ms
Loop: 196.633 ms

Solution

  • Another possible solution, which:

    b = np.zeros(1 + np.amax(indices), dtype=values.dtype)
    np.add.at(b, indices, values)
    

    Output:

    array([2., 1., 7.])