Say that I have a NumPy array:
a = np.array([0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 11, 12, 13, 13, 13, 14, 15])
And I have a length m = 2
that the user specifies in order to see if there are any repeats of that length within the time series. In this case, the repeats of length m = 2
are:
[2, 2]
[5, 5]
[9, 9]
[9, 9]
[13, 13]
And the user can change this to m = 3
and the repeats of length m = 3
are:
[9, 9, 9]
[13, 13, 13]
I need a function that either returns the index of where a repeat is found or None
. So, for m = 3
the function would return the following NumPy array of starting indices:
[11, 17]
And for m = 4
the function would return None
. What's the cleanest and fastest way to accomplish this?
Update
Note that the array does not have to be sorted and we are not interested in the result after a sort. We only want the result from the unsorted array. Your result for m = 2
should be the same for this array:
b = np.array([0, 11, 2, 2, 3, 40, 5, 5, 16, 7, 80, 9, 9, 9, 1, 11, 12, 13, 13, 13, 4, 5])
Approach #1
We could leverage 1D convolution
for a vectorized solution -
def consec_repeat_starts(a, n):
N = n-1
m = a[:-1]==a[1:]
return np.flatnonzero(np.convolve(m,np.ones(N, dtype=int))==N)-N+1
Sample runs -
In [286]: a
Out[286]:
array([ 0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 11, 12,
13, 13, 13, 14, 15])
In [287]: consec_repeat_starts(a, 2)
Out[287]: array([ 2, 6, 11, 12, 17, 18])
In [288]: consec_repeat_starts(a, 3)
Out[288]: array([11, 17])
In [289]: consec_repeat_starts(a, 4)
Out[289]: array([], dtype=int64)
Approach #2
We could also make use of binary-erosion
-
from scipy.ndimage.morphology import binary_erosion
def consec_repeat_starts_v2(a, n):
N = n-1
m = a[:-1]==a[1:]
return np.flatnonzero(binary_erosion(m,[1]*N))-(N//2)