I have a large array and I want to mask out certain values (set them to nodata). But I'm experiencing an out-of-memory error despite having sufficient RAM.
I have shown below an example that reproduces my situation. My array is 14.5 GB and the mask is ~7GB, but I have 64GB of RAM dedicated to this, so I don't understand why this fails.
import numpy as np
arr = np.zeros((1, 71829, 101321), dtype='uint16')
arr.nbytes
#14555572218
mask = np.random.randint(2, size=(71829, 101321), dtype='bool')
mask.nbytes
#7277786109
nodata = 0
#this results in OOM error
arr[:, mask] = nodata
Interestingly, if I do the following, then things work.
arr = np.zeros((71829, 101321), dtype='uint16')
arr.nbytes
#14555572218
mask = np.random.randint(2, size=(71829, 101321), dtype='bool')
mask.nbytes
#7277786109
nodata = 0
#this works
arr[mask] = nodata
But it isn't something I can use. This code will be a part of a library module that would need to accept a variable value for the zeroth dimension.
My guess is that arr[mask] = nodata
is modifying the array in-place but arr[:, mask] = nodata
is creating a new array, but I don't know why that would be the case. Even if it did, there should still be enough space for that, since the total size of arr
and mask
would be 22GB and I have 64GB of RAM.
I tried searching about this, I found this but I'm new to numpy and I didn't understand the explanation of the longer answer. I did try the np.where
approach from the other answer to that question, but I still get OOM error.
Any input would be appreciated.
I suspect the issue here is that combining slice-based and mask-based indexing leads to a memory-inefficient codepath. You might try expressing it this way so that you're using entirely mask-based indexing:
arr[mask[None]] = nodata
I don't know enough about the implementation of np.ndarray.__setitem__
to guess at why the arr[:, mask]
version leads to memory issues.