numpy dask python-xarray scikit-image geotiff

How to reduce xarray.coarsen with majority vote?

I'm currently trying to resample a large geotiff file to a coarser resolution. This file contains classes of tree species (indicated by integer values) at each pixel, so I want to resample each block (3x3 or 2x2) to contain the most often occurring integer/tree species (ignoring nan/fill value).

My first approach was using scipy.stats.mode, which does exactly that, but it it made RAM explode, no matter if I use raw numpy or dask. (See my issue at github: link).

The actual data is just 20 % of my RAM (roughly 5.5 GB, shape: (86145, 64074) of uint8), so it should be possible (I think). Dask also cannot run this lazily and computes instant, which is also confusing to me (and the reason for my initial question in the issue). It just keeps allocating RAM until my RAM and SWAP is full and my laptop collapses.

In my head this task is actually quite simple:

Reshape the array to contain 2 new dimensions which represent the reduced window at each new location (which is exactly what xr.coarsen or skimage.measure.block_reduce does)
For each new location apply a function to all the values in the window, in my case mode.
Result is the same dataset in coarser resolution

I'm probably a bit naive and its not that easy, but I cant wrap my head around why this is so hard to do, it seems to not be that hard of a computational task. I tried many ways, from using np.bincount and np.unique to implementing something myself in python (and using numba), to reshaping everything into 2 dimensional data, but nothing works. Is numpy/python simply not the right tool?

So my question is:

How can I solve this efficiently using numpy?

Example code to reproduce:

import xarray as xr
import dask.array as da
import numpy as np
from scipy import stats

# Define dimensions
nx, ny, nt = 3000, 300, 100  # size of each dimension
chunks = (300, 30, 10)     # chunk sizes for Dask

# Create Dask arrays
data1 = da.random.random((nx, ny, nt), chunks=chunks)
data2 = da.random.random((nx, ny, nt), chunks=chunks)

# Create coordinates
x = np.linspace(0, 10, nx)
y = np.linspace(0, 5, ny)
time = np.arange(nt)

# Build the xarray Dataset
ds = xr.Dataset(
    {
        "temperature": (("x", "y", "time"), data1),
        "precipitation": (("x", "y", "time"), data2),
    },
    coords={
        "x": x,
        "y": y,
        "time": time,
    }
)

# custom function for accessing the mode
def find_mode(arr, axis):
    m, _ = stats.mode(arr, axis=axis)
    return m

# this is lazy!
coarse_mean_ds = ds.coarsen(x=3, y=3, boundary='pad').reduce(np.mean)

# this computes  on spotand and allocates far too much RAM in a single worker!
maj_vote_coarse = ds.coarsen(x=3, y=3, boundary='pad').reduce(find_mode)

Solution

My first approach was using scipy.stats.mode, which does exactly that, but it it made RAM explode, no matter if I use raw numpy or dask. (See my issue at github: link).

Reading the suggestion made in that issue by keewis, it seems like it could be fixed by flattening the axes that one is taking the mode over.

In other words, if one has a 2 x 2 x 3 x 3 array, which represents a 2x2 grid of 3x3 patches, one can take the mode of each patch by reshaping into a 2 x 2 x 9 array, and taking the mode over the last axis.

For example:

def find_mode(arr, numaxes):
    """Compute mode of patches, where patches are specified on last `numaxes` axes"""
    # print(f"find_mode called with {arr.shape=} {arr.size=}")
    # Check that array shape is what we expect
    assert len(arr.shape) >= numaxes + 1, "must be called on array which is at least 3D"
    new_shape = arr.shape[:-numaxes]
    # Reshape, compressing last `numaxes` dimensions, then take mode over that dimension
    arr = arr.reshape(*new_shape, -1)
    m, _ = stats.mode(arr, axis=-1)
    return m


def mode(arr, axis):
    # from https://github.com/pydata/xarray/issues/10636#issuecomment-3198355198
    if not isinstance(axis, (tuple, list)):
        axis = tuple([axis])
    else:
        axis = tuple(axis)

    axes = [axis]

    # The signature of this gufunc is that it takes in an
    # array with the same number of dimensions as axis, and outputs
    # a scalar.
    symbols = [string.ascii_lowercase[idx] for idx, _ in enumerate(axis)]
    signature = f"({','.join(symbols)})->()"
    numaxes = len(axis)

    return da.apply_gufunc(
        lambda x: find_mode(x, numaxes),
        signature,
        arr,
        axes=axes,
        allow_rechunk=True,
        output_dtypes=[arr.dtype],
    )


maj_vote_coarse = ds.coarsen(x=3, y=3, boundary='pad').reduce(mode)

In my testing, I found this solution no longer computes eagerly, uses much less memory than the solution shown in the question, and is somewhat slower. In the table below, I also benchmarked .reduce(np.mean) as a baseline.

Solution	Peak Memory Usage	Time Taken
Built-in mean	296.02 MB	7.15 s
Original mode	2644.11 MB	1088.17 s
lazy mode	322.7 MB	1897.70 s

Benchmarking notes:

Benchmarked on a single node, 4 core system
All memory measurements were made via resource.getrusage(resource.RUSAGE_SELF).ru_maxrss. This is a measurement of peak memory use over the lifetime of a process, made by the operating system.