Numpy: apply mask to values, then take mean, but in parallel

I have an 1d numpy array of values:

v = np.array([0, 1, 4, 0, 5])

Furthermore, I have a 2d numpy array of boolean masks (in production, there are millions of masks):

m = np.array([
    [True, True, False, False, False],
    [True, False, True, False, True],
    [True, True, True, True, True],
])

I want to apply each row from the mask to the array v, and then compute the mean of the masked values.

Expected behavior:

results = []
for mask in m:
    results.append(np.mean(v[mask]))

print(results) # [0.5, 3.0, 2.0]

Easy to do sequentially, but I am sure there is a beautiful version in parallel? One solution, that I've found:

mask = np.ones(m.shape)
mask[~m] = np.nan
np.nanmean(v * mask, axis=1) # [0.5, 3.0, 2.0]

Is there another solution, perhaps using np.ma module? I am looking for a solution that is faster than my current two solutions.

Solution

Faster Numpy code

A faster Numpy way to do that is to perform a matrix multiplication:

(m @ v) / m.sum(axis=1)

We can optimize this further by avoiding implicit conversion and perform the summation with 8-bit integer (this is safe only because v.shape[1] is small -- ie. less than 127):

(m @ v) / m.view(np.int8).sum(dtype=np.int8, axis=1)

Even faster code with Numba multithreading

We can use Numba to write a similar function and even use multiple threads:

import numba as nb

@nb.njit('(float32[::1], bool_[:,::1])', parallel=True)
def compute(v, m):
    si, sj = m.shape
    res = np.empty(si, dtype=np.float32)
    for i in nb.prange(si):
        s = np.float32(0)
        count = 0
        for j in range(sj):
            if m[i, j]:
                s += v[j]
                count += 1
        if count > 0:
            res[i] = s / count
        else:
            res[i] = np.nan
    return res

Results

Here are results on my i5-9600KF CPU (with 6 cores):

Initial vectorized code:   136 ms
jakevdp's solution:         74 ms
Numpy matmul:               28 ms
Numba sequential code:      21 ms
Optimized Numpy matmul:     20 ms   <-----
Numba parallel code:         4 ms   <-----

The sequential Numba implementation is unfortunately not much better than the Numpy one (Numba do not generate a very good code in this case), but it scale well so the parallel version is significantly faster.

The Numba parallel version is 34 times faster than the initial vectorized code and 18 times faster than the jakevdp's solution. The Numpy optimized matrix-multiplication-based solution is 3.7 times faster jakevdp's solution.

Note that the Numba code is also a bit better than the jakevdp's solution and the optimized Numpy one since it support the case where the mean is invalid (NaN) without printing a warning about a division by 0.