pythonnumpymedian-of-medians

Numpy median-of-means computation across unequal-sized array


Assume a numpy array X of shape m x n and type float64. The rows of X need to pass through an element-wise median-of-means computation. Specifically, the m row indices are partitioned into b "buckets", each containing m/b such indices. Next, within each bucket I compute the mean and across the resulting means I do a final median computation.

An example that clarifies it is

import numpy as np

m = 10
n = 10000

# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)

# Number of buckets to split rows into
b = 5

# Partition the rows of X into b buckets
row_indices = np.arange(X.shape[0])
buckets = np.array(np.array_split(row_indices, b))
X_bucketed = X[buckets, :]

# Compute the mean within each bucket
bucket_means = np.mean(X_bucketed, axis=1)

# Compute the median-of-means
median = np.median(bucket_means, axis=0)

# Edit - Method 2 (based on answer)
np.random.shuffle(row_indices)
X = X[row_indices, :]
buckets2 = np.array_split(X, b, axis=0)
bucket_means2 = [np.mean(x, axis=0) for x in buckets2]
median2 = np.median(np.array(bucket_means2), axis=0)

This program works fine if b divides m since np.array_split() results in partitioning the indices in equal parts and array buckets is a 2D array.

However, it does not work if b does not divide m. In that case, np.array_split() still splits into b buckets but of unequal sizes, which is fine for my purposes. For example, if b = 3 it will split the indices {0,1,...,9} into [0 1 2 3], [4 5 6] and [7 8 9]. Those arrays cannot be stacked onto one another so the array buckets is not a 2D array and it cannot be used to index X_bucketed.

How can I make this work for unequal-sized buckets, i.e., to have the program compute the mean within each bucket (irrespective of its size) and then the median across the buckets?

I cannot fully grasp masked arrays and I am not sure if those can be used here.


Solution

  • You can consider computing each buckets' mean separately, then stack and compute the median. Also you can just use array_split to X directly, no need to index it with a sliced index array (maybe this was your main question?).

    m = 11
    n = 10000
    
    # A random data matrix
    X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
    
    # Number of buckets to split rows into
    b = 5
    
    # Partition the rows of X into b buckets
    buckets = np.array_split(X, 2, axis = 0)
    
    # Compute the mean within each bucket
    b_means = [np.mean(x, axis=0) for x in buckets]
    
    # Compute the median-of-means
    median = np.median(np.array(b_means), axis=0)
    
    print(median) #(10000,) shaped array