python arrays numpy signal-processing windowing

Fast conditional overlapping windowing (framing) of numpy array

I have a huge list of numpy arrays (1 dimensional), which are time series for different events. Each point has a label, and I want to window the numpy arrays based on its label. The labels I have is 0, 1, and 2. Each window has a fixed size M.

The label of each window will be the biggest label available in the window. So if a window consists of both 0 an 1 labeled datapoints, the label will be 1 for the whole window.

But the problem is that, the windowing is not label agnostic. Because of class imbalance, I want to only do overlapped windowing in case of labels 1 and 2.

So far I have written this code:

# conditional framing
data = []
start_cursor = 0
while start_cursor < arr.size:
  end_cursor = start_cursor + window_size
  data.append(
    {
      "frame": arr[start_cursor:end_cursor],
      "label": y[start_cursor:end_cursor].max(),
    }
  )
  start_cursor = end_cursor
  if np.any(y[start_cursor, end_cursor] != 0):
    start_cursor = start_cursor - overlap_size

But this is clearly too verbose and just plain inefficient, especially because I will call this while loop on my huge list of separate arrays.

EDIT: to explain the problem more. Imagine you are going to window a signal with fixed length M. If there only exists 0 label points in the window, there will be no overlap between adjacent windows. But if there exists labels 1 and 2, there will be an overlap between two signals with percentage p%.

Solution

I think this does what you are asking to do. The visualization for checking isn't great, but it helps you see how the windowing works. Hopefully I understood your question right and this is what you are trying to do. Anytime there is a 1 or 2 in the time series (rather than a 0) the window steps forward some fraction of the full window length (here 50%).

To examine how to do this, start with a sample time series:

import matplotlib.pylab as plt
import numpy as np

N = 5000 # time series length

# create some sort of data set to work with
x = np.zeros(N)
# add a few 1s and 2s to the list (though really they are the same for the windowing)
y = np.random.random(N)
x[y < 0.01] = 1
x[y < 0.005] = 2

# assign a window length
M = 50 # window length
overlap = 0.5 # assume 50% overlap
M_overlap = int(M * (1-overlap))

My approach is to sum the window of interest over your time series. If the sum ==0, there is no overlap between windows and if it is >0 then there is overlap. The question, then, becomes how should we calculate these sums efficiently? I compare two approaches. The first is simply to walk through the time series and the second is to use convolution (which is much faster). For the first one, I also explore different ways of assessing window size after summation.

Summation (slow version)

def window_sum1():
    # start of windows in list windows
    windows = [0,]
    while windows[-1] + M < N:
        check = sum(x[windows[-1]:windows[-1]+M]) == 0
        windows.append(windows[-1] + M_overlap + (M - M_overlap) * check)
        if windows[-1] + M > N:
            windows.pop()
            break
    # plotting stuff for checking
    return(windows)
Niter = 10**4
print(timeit.timeit(window_sum1, number = Niter))
# 29.201083058

So this approach went through 10,000 time series of length 5000 in about 30 seconds. But the line windows.append(windows[-1] + M_overlap + (M - M_overlap) * check) can be streamlined in an if statement.

Summation (fast version, 33% faster than slow version)

def window_sum2():
    # start of windows in list windows
    windows = [0,]
    while windows[-1] + M < N:
        check = sum(x[windows[-1]:windows[-1]+M]) == 0
        if check:
            windows.append(windows[-1] + M)
        else:
            windows.append(windows[-1] + M_overlap)
        if windows[-1] + M > N:
            windows.pop()
            break
    # plotting stuff for checking
    return(windows)
print(timeit.timeit(window_sum2, number = Niter))
# 20.456240447000003

We see a 1/3 reduction in time with the if statement.

Convolution (85% faster than fast summation)

We can use signal processing to get a lot faster, by convolving the time series with the window of interest using numpy.convolve. (Disclaimer: I got the idea from the accepted answer to this question.) Of course, it also makes sense to adopt the faster window size assessment from above.

def window_conv():
    a = np.convolve(x,np.ones(M,dtype=int),'valid')
    windows = [0,]
    while windows[-1] + M < N:
        if a[windows[-1]]:
            windows.append(windows[-1] + M_overlap)
        else:
            windows.append(windows[-1] + M)
        if windows[-1] + M > N:
            windows.pop()
            break
    return(windows)
print(timeit.timeit(window_conv, number = Niter))
#3.3695770570000008

Sliding window

The last thing I will add is that, as shown in one of the comments of this question, as of numpy 1.20 there is a function called sliding_window_view. I still have numpy 1.19 running and was not able to test it to see if it's faster than convolution.