I have a huge list of numpy arrays (1 dimensional), which are time series for different events. Each point has a label, and I want to window the numpy arrays based on its label. The labels I have is 0, 1, and 2. Each window has a fixed size M.
The label of each window will be the biggest label available in the window. So if a window consists of both 0 an 1 labeled datapoints, the label will be 1 for the whole window.
But the problem is that, the windowing is not label agnostic. Because of class imbalance, I want to only do overlapped windowing in case of labels 1 and 2.
So far I have written this code:
# conditional framing
data = []
start_cursor = 0
while start_cursor < arr.size:
end_cursor = start_cursor + window_size
data.append(
{
"frame": arr[start_cursor:end_cursor],
"label": y[start_cursor:end_cursor].max(),
}
)
start_cursor = end_cursor
if np.any(y[start_cursor, end_cursor] != 0):
start_cursor = start_cursor - overlap_size
But this is clearly too verbose and just plain inefficient, especially because I will call this while loop on my huge list of separate arrays.
EDIT: to explain the problem more. Imagine you are going to window a signal with fixed length M. If there only exists 0 label points in the window, there will be no overlap between adjacent windows. But if there exists labels 1 and 2, there will be an overlap between two signals with percentage p%.
I think this does what you are asking to do. The visualization for checking isn't great, but it helps you see how the windowing works. Hopefully I understood your question right and this is what you are trying to do. Anytime there is a 1 or 2 in the time series (rather than a 0) the window steps forward some fraction of the full window length (here 50%).
To examine how to do this, start with a sample time series:
import matplotlib.pylab as plt
import numpy as np
N = 5000 # time series length
# create some sort of data set to work with
x = np.zeros(N)
# add a few 1s and 2s to the list (though really they are the same for the windowing)
y = np.random.random(N)
x[y < 0.01] = 1
x[y < 0.005] = 2
# assign a window length
M = 50 # window length
overlap = 0.5 # assume 50% overlap
M_overlap = int(M * (1-overlap))
My approach is to sum the window of interest over your time series. If the sum ==0
, there is no overlap between windows and if it is >0
then there is overlap. The question, then, becomes how should we calculate these sums efficiently? I compare two approaches. The first is simply to walk through the time series and the second is to use convolution (which is much faster). For the first one, I also explore different ways of assessing window size after summation.
Summation (slow version)
def window_sum1():
# start of windows in list windows
windows = [0,]
while windows[-1] + M < N:
check = sum(x[windows[-1]:windows[-1]+M]) == 0
windows.append(windows[-1] + M_overlap + (M - M_overlap) * check)
if windows[-1] + M > N:
windows.pop()
break
# plotting stuff for checking
return(windows)
Niter = 10**4
print(timeit.timeit(window_sum1, number = Niter))
# 29.201083058
So this approach went through 10,000 time series of length 5000 in about 30 seconds. But the line windows.append(windows[-1] + M_overlap + (M - M_overlap) * check)
can be streamlined in an if statement.
Summation (fast version, 33% faster than slow version)
def window_sum2():
# start of windows in list windows
windows = [0,]
while windows[-1] + M < N:
check = sum(x[windows[-1]:windows[-1]+M]) == 0
if check:
windows.append(windows[-1] + M)
else:
windows.append(windows[-1] + M_overlap)
if windows[-1] + M > N:
windows.pop()
break
# plotting stuff for checking
return(windows)
print(timeit.timeit(window_sum2, number = Niter))
# 20.456240447000003
We see a 1/3 reduction in time with the if statement.
Convolution (85% faster than fast summation)
We can use signal processing to get a lot faster, by convolving the time series with the window of interest using numpy.convolve. (Disclaimer: I got the idea from the accepted answer to this question.) Of course, it also makes sense to adopt the faster window size assessment from above.
def window_conv():
a = np.convolve(x,np.ones(M,dtype=int),'valid')
windows = [0,]
while windows[-1] + M < N:
if a[windows[-1]]:
windows.append(windows[-1] + M_overlap)
else:
windows.append(windows[-1] + M)
if windows[-1] + M > N:
windows.pop()
break
return(windows)
print(timeit.timeit(window_conv, number = Niter))
#3.3695770570000008
Sliding window
The last thing I will add is that, as shown in one of the comments of this question, as of numpy 1.20
there is a function called sliding_window_view. I still have numpy 1.19
running and was not able to test it to see if it's faster than convolution.