I'm currently trying to resample a large geotiff file to a coarser resolution. This file contains classes of tree species (indicated by integer values) at each pixel, so I want to resample each block (3x3 or 2x2) to contain the most often occurring integer/tree species (ignoring nan/fill value).
My first approach was using scipy.stats.mode
, which does exactly that, but it it made RAM explode, no matter if I use raw numpy or dask. (See my issue at github: link).
The actual data is just 20 % of my RAM (roughly 5.5 GB, shape: (86145, 64074) of uint8
), so it should be possible (I think). Dask also cannot run this lazily and computes instant, which is also confusing to me (and the reason for my initial question in the issue). It just keeps allocating RAM until my RAM and SWAP is full and my laptop collapses.
In my head this task is actually quite simple:
xr.coarsen
or skimage.measure.block_reduce
does)I'm probably a bit naive and its not that easy, but I cant wrap my head around why this is so hard to do, it seems to not be that hard of a computational task.
I tried many ways, from using np.bincount
and np.unique
to implementing something myself in python (and using numba
), to reshaping everything into 2 dimensional data, but nothing works. Is numpy/python simply not the right tool?
So my question is:
Example code to reproduce:
import xarray as xr
import dask.array as da
import numpy as np
from scipy import stats
# Define dimensions
nx, ny, nt = 3000, 300, 100 # size of each dimension
chunks = (300, 30, 10) # chunk sizes for Dask
# Create Dask arrays
data1 = da.random.random((nx, ny, nt), chunks=chunks)
data2 = da.random.random((nx, ny, nt), chunks=chunks)
# Create coordinates
x = np.linspace(0, 10, nx)
y = np.linspace(0, 5, ny)
time = np.arange(nt)
# Build the xarray Dataset
ds = xr.Dataset(
{
"temperature": (("x", "y", "time"), data1),
"precipitation": (("x", "y", "time"), data2),
},
coords={
"x": x,
"y": y,
"time": time,
}
)
# custom function for accessing the mode
def find_mode(arr, axis):
m, _ = stats.mode(arr, axis=axis)
return m
# this is lazy!
coarse_mean_ds = ds.coarsen(x=3, y=3, boundary='pad').reduce(np.mean)
# this computes on spotand and allocates far too much RAM in a single worker!
maj_vote_coarse = ds.coarsen(x=3, y=3, boundary='pad').reduce(find_mode)
My first approach was using
scipy.stats.mode
, which does exactly that, but it it made RAM explode, no matter if I use raw numpy or dask. (See my issue at github: link).
Reading the suggestion made in that issue by keewis, it seems like it could be fixed by flattening the axes that one is taking the mode over.
In other words, if one has a 2 x 2 x 3 x 3 array, which represents a 2x2 grid of 3x3 patches, one can take the mode of each patch by reshaping into a 2 x 2 x 9 array, and taking the mode over the last axis.
For example:
def find_mode(arr, numaxes):
"""Compute mode of patches, where patches are specified on last `numaxes` axes"""
# print(f"find_mode called with {arr.shape=} {arr.size=}")
# Check that array shape is what we expect
assert len(arr.shape) >= numaxes + 1, "must be called on array which is at least 3D"
new_shape = arr.shape[:-numaxes]
# Reshape, compressing last `numaxes` dimensions, then take mode over that dimension
arr = arr.reshape(*new_shape, -1)
m, _ = stats.mode(arr, axis=-1)
return m
def mode(arr, axis):
# from https://github.com/pydata/xarray/issues/10636#issuecomment-3198355198
if not isinstance(axis, (tuple, list)):
axis = tuple([axis])
else:
axis = tuple(axis)
axes = [axis]
# The signature of this gufunc is that it takes in an
# array with the same number of dimensions as axis, and outputs
# a scalar.
symbols = [string.ascii_lowercase[idx] for idx, _ in enumerate(axis)]
signature = f"({','.join(symbols)})->()"
numaxes = len(axis)
return da.apply_gufunc(
lambda x: find_mode(x, numaxes),
signature,
arr,
axes=axes,
allow_rechunk=True,
output_dtypes=[arr.dtype],
)
maj_vote_coarse = ds.coarsen(x=3, y=3, boundary='pad').reduce(mode)
In my testing, I found this solution no longer computes eagerly, uses much less memory than the solution shown in the question, and is somewhat slower. In the table below, I also benchmarked .reduce(np.mean)
as a baseline.
Solution | Peak Memory Usage | Time Taken |
---|---|---|
Built-in mean | 296.02 MB | 7.15 s |
Original mode | 2644.11 MB | 1088.17 s |
lazy mode | 322.7 MB | 1897.70 s |
Benchmarking notes:
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
. This is a measurement of peak memory use over the lifetime of a process, made by the operating system.