I have a three-dimensional DataArray and I would like to identify the indices where a 3x3 window in the last two dimensions does not contain any nans.
Consider toy data like so:
np.random.seed(1234)
arr = np.random.rand(20, 10, 10)
arr[arr < 0.1] = np.nan
arr = xr.DataArray(arr, dims=["time", "x", "y"])
My approach is to make a rolling dataset and reduce like below. Note that I cannot use arr.rolling(...).construct(...)
because my actual data is too large.
result = arr.rolling(center=True, x=3, y=3)\
.reduce(lambda x, axis: np.all(~np.isnan(x), axis=axis))
This seems like it works, but why is the output a mixture of nans and 1s instead of nans, 1s, and 0s?
> result.isel(time=0)
array([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, 1., 1., nan, nan, nan, nan],
[nan, nan, nan, nan, 1., 1., nan, nan, nan, nan],
[nan, nan, nan, nan, 1., nan, nan, nan, 1., nan],
[nan, nan, nan, nan, nan, nan, nan, nan, 1., nan],
[nan, nan, nan, nan, nan, nan, nan, nan, 1., nan],
[nan, 1., nan, nan, nan, nan, nan, nan, nan, nan],
[nan, 1., 1., 1., nan, nan, nan, nan, nan, nan],
[nan, 1., 1., 1., 1., 1., 1., nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])
I understand that the outer border is nan because a 3x3 window cannot be constructed there. But I don't understand why nans are showing up in the interior of the array as well. If a window has nan, shouldn't np.all
return false and then cast that to zero?
I only care about the cells with 1s so this isn't a big deal. But, I'm curious to know why this is happening.
Tried a few approaches, and I settled on
indices = np.where(
array.rolling(...)
.mean()
.notnull()
)
This was able to handle the large array without using more than a few GB of RAM when the array is on disk. It used even less when the array is backed by dask. Credit goes to ThomasMGeo on the Pangeo forum. I suspect that calling .construct()
isn't actually using that much memory, but a .stack()
call I had in an earlier version was using a lot of memory.