I am trying to calculate a moving median for around 10.000 signals that each are a list of length around 750.
An example dataframe looks like this:
num_indices = 2000 # Set number of indices
# Generate lists of values (each a list of numbers from 0 to 1)
column_data = [np.random.random(750).tolist() for _ in range(num_indices)]
# Create DataFrame
df = pd.DataFrame({'values': column_data}, index=range(num_indices))
I have found this implementation that uses np.lib.stride_tricks, but it is a bit slow for my purpose. Does anyone have an idea for a faster method?
def moving_median(signal,n=150):
# Compute rolling median for valid windows
swindow = np.lib.stride_tricks.sliding_window_view(signal, (n,))
b = np.nanmedian(swindow, axis=1)
b_full = np.concatenate([[np.nanmedian(signal)]*(n-1), b]) # Prepend first `n-1` values unchanged
return signal - b_full
And finally:
df.iloc[:,0].apply(lambda x: moving_median(x))
You don't mention having NaNs in your data, so I will assume you do not. In that case, I think this is the best SciPy has to offer:
import numpy as np
from scipy.ndimage import median_filter
rng = np.random.default_rng(49825498549428354)
data = rng.random(size=(2000, 750)) # 2000 signals, each of length 750
res = median_filter(data, size=(150,), axes=(-1,)) # moving window of size 150
I believe there was some recent work done to make this faster in the development version of SciPy (nightly wheels here) than before. I'm guessing that rather than re-sorting each window from scratch, it updates a sorted or partitioned data structure based on the incoming and outgoing values, but I haven't really looked into it.
Note the various mode
options in the documentation that control what happens at the boundary. If you are happy to get back an array that is smaller than the original rather the default "reflect" boundary condition, you may just want to use the default mode
and trim the edges afterwards.
If you do have NaNs, SciPy has a new vectorized_filter
that will work with np.nanmedian
, but it just uses stride_tricks.sliding_window_view
under the hood, so unlikely to be faster than what you have.
If CuPy is an option, let me know, and I might be able to suggest something much faster. SciPy's median_filter
still isn't as fast as it should be.