pythonnumpy

Numpy: vectorizing the comparison of 2 boolean arrays to derive start and end indices of regions of interest


Assuming I have 2 boolean arrays of same size, is_overlap and is_incomplete. I would like to retrieve the start and end indices (end indices excluded) of regions of interest in these arrays (these regions are of interest for an analysis achieved in subsequent steps of the algo - start and end indices excluded will be used to slice a 3rd array of same size).

To identify these regions, this set of conditions need to be fulfilled.

Example

import numpy as np

max_n_rows    = 4
# row indices             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
is_overlap    = np.array([0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1], dtype=bool)
is_incomplete = np.array([1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1], dtype=bool)

Result expected is then,

# region indices                         0  1   2   3   4
region_of_interest_starts    = np.array([1, 3,  9, 14, 16])
region_of_interest_ends_excl = np.array([2, 8, 13, 15, 17])

Explanations:


Solution

  • Here is my attempt as an answer. I am frankly not sure this is optimal, and the list of intervals returned at the end is unordered.

    import numpy as np
    
    
    def get_indices_region(mask: np.ndarray) -> np.ndarray:
        """Computes the start and end indices of each connected components in `mask`.
        Taken from https://stackoverflow.com/questions/68514880/finding-contiguous-regions-in-a-1d-boolean-array.
    
        Args:
            mask (np.ndarray): A 1d numpy array of dtype `bool`.
    
        Returns:
            A numpy array containing the list of the start and end indices of each connected components in `mask`.
        """
    
        return np.flatnonzero(np.diff(np.r_[np.int8(0), mask.astype(np.int8), np.int8(0)])).reshape(-1, 2)
    
    
    def your_function(is_overlap: np.ndarray, is_incomplete: np.ndarray, max_n_rows: int = 4) -> np.ndarray:
        # Step 1: compute the enlarged candidate regions
        indices_enlarged = get_indices_region(is_overlap | is_incomplete)
    
        # Step 2: Filter out enlarged candidates based on their length
        lengths_enlarged = indices_enlarged[:, 1] - indices_enlarged[:, 0]
        indices_enlarged = indices_enlarged[lengths_enlarged >= max_n_rows]
    
        # Step 3: compute the reduced candidate regions
        indices_reduced = get_indices_region(is_overlap)
    
        # Step 4: Filter out reduced candidates based on their overlap with the enlarged regions
        mask_overlap = np.any(
            (indices_enlarged[:, None, 0] <= indices_reduced[None, :, 0])
            & (indices_enlarged[:, None, 1] >= indices_reduced[None, :, 1]),
            axis=0
        )
        indices_reduced = indices_reduced[~mask_overlap]
    
        return np.vstack((indices_reduced, indices_enlarged))
    

    For your example, we get:

    max_n_rows = 4
    is_overlap = np.array([0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1], dtype=bool)
    is_incomplete = np.array([1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1], dtype=bool)
    
    your_function(is_overlap, is_incomplete, max_n_rows)
    
    >>> [[1   2]
         [14 15]
         [16 17]
         [3   8]
         [9  13]]