pythonarraysnumpyboolean-indexing

Python/Numpy/Boolean Indexing: Modify boolean array at locations with N consecutive True values


Given a numpy boolean array

arr = np.array([1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1])

I would like to indicate where there are at least n consecutive true values (from left to right).

For n = 2:

#         True 2x (or more) in a row
#            /  \        /     \
arr = [1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]

#                 becomes:

res = [0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0]
#               ^-----------^--^-------- A pattern of 2 or more consecutive True's ends at each of these locations

For n = 3:

#          True 3x (or more) in a row
#                        /     \
arr = [1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]

#                 becomes:

res = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
#                              ^-------- A pattern of 3 or more consecutive True's ends at this location

Is there a pythonic approach to this without using a for loop to iterate over each element? Performance is important here as my application will perform this operation 1000's of times on boolean arrays containing 1000's of elements.

Notes worth mentioning:

  1. n can be any value above 2
  2. n-consecutive patterns can appear anywhere in the array; beginning, middle or end.
  3. The shape of the result array must match that of the original array.

Any help would be very much appreciated.

Benchmarks on answers

fully vectorized by alain-t:
10000 loops, best of 5: 0.138 seconds per loop, worse of 5: 0.149 seconds per loop

pad/shift by mozway:
10000 loops, best of 5: 1.62 seconds per loop, worse of 5: 1.71 seconds per loop

sliding_window_view by kubatucka (with padding by mozway):
10000 loops, best of 5: 1.15 seconds per loop, worse of 5: 1.54 seconds per loop

Solution

  • You can multiply each element with its predecessor. This will give you 1s for sequences of two or more. Do it again on the result to get sequences of 3 or more:

    import numpy as np
    arr = np.array([1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1])
    
    arr[1:] *= arr[:-1]   # two consecutive
    arr[0]  = 0           # 1st '1' isn't a two-consec.
    arr[1:] *= arr[:-1]   # three consecutive 
    
    print(arr)
    [0 0 0 0 0 0 0 0 1 0 0]
    

    You may also try it this way (which is a bit faster):

    import numpy as np
    arr = np.array([1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1])
    
    arr[2:] *= arr[:-2] * arr[1:-1]
    arr[:2] = 0
    
    print(arr)
    [0 0 0 0 0 0 0 0 1 0 0]
    

    A generalized solution to n-consecutives could not use the second approach but can be performed in a loop using the 1st one:

    n = 3
    arr[1:]  *= arr[:-1]
    arr[:n-1] = 0
    for i in range(n-2):
        arr[1:] *= arr[:-1]
    

    Note that this loses some of the benefits of vectorization.

    For a fully vectorized n-consecutive approach, you could select the matching differences between the running maximum of 0 positions with all item positions. The 1 positions of the result will contain the number of consecutive 1s at that position (and the 0 positions will be zero)

    n = 3
    i = np.arange(arr.size)+1
    mask = n <= i-np.maximum.accumulate((1-arr)*i) #True/False array
    
    print(mask*1)
    [0 0 0 0 0 0 0 0 1 0 0]
    

    Visually:

    arr                  [ 1  0  1  1  0  0  1  1  1  0   1  ]
    i                    [ 1  2  3  4  5  6  7  8  9  10  11 ] -+
    (1-arr)*i            [ 0  2  0  0  5  6  0  0  0  10  0  ]  |-> differences
    maximum.accumulate   [ 0  2  2  2  5  6  6  6  6  10  10 ] -+     |
    i-np.maximum...      [ 1  0  1  2  0  0  1  2  3  0   1  ] <------+
    

    As a side benefit from this approach, you actually get all the n-consecutive values in one operation from i-np.maximum.accumulate((1-arr)*i) so you can store them and check for different values of n without redoing the calculations.