I have an array filled with data only in lower triangle spaces, the rest is np.nan. I want to do some operations on this matrix, more precisely- with data elements, not nans, because I expect the behaviour when nans elements are skipped in vectorized operation to be much quicker.
I have two test arrays:
arr = np.array([
[1.111, 2.222, 3.333, 4.444, 5.555],
[6.666, 7.777, 8.888, 9.999, 10.10],
[11.11, 12.12, 13.13, 14.14, 15.15],
[16.16, 17.17, 18.18, 19.19, 20.20],
[21.21, 22.22, 23.23, 24.24, 25.25]
])
arr_nans = np.array([
[np.nan, np.nan, np.nan, np.nan, np.nan],
[6.666, np.nan, np.nan, np.nan, np.nan],
[11.11, 12.12, np.nan, np.nan, np.nan],
[16.16, 17.17, 18.18, np.nan, np.nan],
[21.21, 22.22, 23.23, 24.24, np.nan]
])
Thats the way I test them:
test = timeit.timeit('arr * 5 / 2.123', globals=globals(), number=1000)
test_nans = timeit.timeit('arr_nans * 5 / 2.123', globals=globals(), number=1000)
masked_arr_nans = np.ma.array(arr_nans, mask=np.isnan(arr_nans))
test_masked_nans = timeit.timeit('masked_arr_nans * 5 / 2.123', globals=globals(), number=1000)
print(test) # 0.0017232997342944145s
print(test_nans) # 0.0017070993781089783s
print(test_masked_nans) # 0.052730199880898s
I have created a mask array masked_arr_nans
and masked all nans. But this way is far slower then the first two. I dont understand why.
The main question is- which is the quckest way to operate with arrays like arr_nans
containing a lot of nans, probably there is a qicker approach then the ones I mentioned.
Side question is- why masked array works much slower?
I think this hypothesis is incorrect:
I expect the behaviour when nans elements are skipped in vectorized operation to be much quicker
In your array the data is contiguous, which is among others why vectorization is fast. If you used a masked array, this doesn't change this fact, there will be as much data and the masked portions will need to be ignored during processing. This has an extra cost of verifying which data is masked or not. Skipping the data will still need to happen in the masked array.
Quite often, with vectorized operations, if is more efficient to perform extra operations and handle the data as contiguous values rather that trying to optimize the number of operations.
If really you need to perform several operations or complex/expensive computations on a subset of the data, I would advise to create a new array with just this data. The cost of selecting the data will be only paid once or will be lower than of the computations.
idx = np.tril_indices_from(arr, k=-1)
tril_arr = arr[idx]
# do several things with tril_arr
# restore a rectangular form
out = np.full_like(arr, np.nan)
out[idx] = tril_arr
Let take your input array and perform repeated operations on it (for each operation we compute arr = 1 - arr
). We either apply the operation on the full array or on the flattened lower triangle.
The cost of selecting the subset of the data is not worth it if we perform a few operations. After enough intermediate operations this become identical in speed:
Now let's use a more complex/expensive computation (arr = log(exp(arr))
). Now we see two things:
arr = 1-arr
example:As a rule of thumb, if the operation you want to perform on the non-masked values is cheap or non repeated, don't bother and apply it on the whole thing. If the operation is complex/expensive/repeated, then consider subsetting the data.
Plots above is a subset vs relative format: