pythonnumpyscikit-imagebinary-image

Efficient way to find coordinates of connected blobs in binary image


I am looking for the coordinates of connected blobs in a binary image (2d numpy array of 0 or 1).

The skimage library provides a very fast way to label blobs within the array (which I found from similar SO posts). However I want a list of the coordinates of the blob, not a labelled array. I have a solution which extracts the coordinates from the labelled image. But it is very slow. Far slower than the inital labelling.

Minimal Reproducible example:

import timeit
from skimage import measure
import numpy as np

binary_image = np.array([
        [0,1,0,0,1,1,0,1,1,0,0,1],
        [0,1,0,1,1,1,0,1,1,1,0,1],
        [0,0,0,0,0,0,0,1,1,1,0,0],
        [0,1,1,1,1,0,0,0,0,1,0,0],
        [0,0,0,0,0,0,0,1,1,1,0,0],
        [0,0,1,0,0,0,0,0,0,0,0,0],
        [0,1,0,0,1,1,0,1,1,0,0,1],
        [0,0,0,0,0,0,0,1,1,1,0,0],
        [0,1,1,1,1,0,0,0,0,1,0,0],
        ])

print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)

labels = measure.label(binary_image)

print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)

def extract_blobs_from_labelled_array(labelled_array):
    # The goal is to obtain lists of the coordinates
    # Of each distinct blob.

    blobs = []

    label = 1
    while True:
        indices_of_label = np.where(labelled_array==label)
        if not indices_of_label[0].size > 0:
            break
        else:
            blob =list(zip(*indices_of_label))
            label+=1
            blobs.append(blob)


if __name__ == "__main__":
    print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
    print("Time taken:")
    print(
        timeit.timeit(
            'extract_blobs_from_labelled_array(labels)', 
            globals=globals(),
            number=1
            )
        )
    print("\n\n")

Output:

2d array of type: <class 'numpy.ndarray'>:
[[0 1 0 0 1 1 0 1 1 0 0 1]
 [0 1 0 1 1 1 0 1 1 1 0 1]
 [0 0 0 0 0 0 0 1 1 1 0 0]
 [0 1 1 1 1 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1 1 1 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 1 1 0 1 1 0 0 1]
 [0 0 0 0 0 0 0 1 1 1 0 0]
 [0 1 1 1 1 0 0 0 0 1 0 0]]


2d array with connected blobs labelled of type <class 'numpy.ndarray'>:
[[ 0  1  0  0  2  2  0  3  3  0  0  4]
 [ 0  1  0  2  2  2  0  3  3  3  0  4]
 [ 0  0  0  0  0  0  0  3  3  3  0  0]
 [ 0  5  5  5  5  0  0  0  0  3  0  0]
 [ 0  0  0  0  0  0  0  3  3  3  0  0]
 [ 0  0  6  0  0  0  0  0  0  0  0  0]
 [ 0  6  0  0  7  7  0  8  8  0  0  9]
 [ 0  0  0  0  0  0  0  8  8  8  0  0]
 [ 0 10 10 10 10  0  0  0  0  8  0  0]]


Beginning extract_blobs_from_labelled_array timing

Time taken:
9.346099977847189e-05

9e-05 is small but so is this image for the example. In reality I am working with very high resolution images for which the function takes approximately 10 minutes.

Is there a faster way to do this?

Side note: I'm only using list(zip()) to try get the numpy coordinates into something I'm used to (I don't use numpy much just Python). Should I be skipping this and just using the coordinates to index as-is? Will that speed it up?


Solution

  • The part of the code that slow is here:

        while True:
            indices_of_label = np.where(labelled_array==label)
            if not indices_of_label[0].size > 0:
                break
            else:
                blob =list(zip(*indices_of_label))
                label+=1
                blobs.append(blob)
    

    First, a complete aside: you should avoid using while True when you know the number of elements you will be iterating over. It's a recipe for hard-to-find infinite-loop bugs.

    Instead, you should use:

        for label in range(np.max(labels)):
    

    and then you can ignore the if ...: break.

    A second issue is indeed that you are using list(zip(*)), which is slow compared to NumPy functions. Here you could get approximately the same result with np.transpose(indices_of_label), which will get you a 2D array of shape (n_coords, n_dim), ie (n_coords, 2).

    But the Big Issue is the expression labelled_array == label. This will examine every pixel of the image once for every label. (Twice, actually, because then you run np.where(), which takes another pass.) This is a lot of unnecessary work, as the coordinates can be found in one pass.

    The scikit-image function skimage.measure.regionprops can do this for you. regionprops goes over the image once and returns a list containing one RegionProps object per label. The object has a .coords attribute containing the coordinates of each pixel in the blob. So, here's your code, modified to use that function:

    import timeit
    from skimage import measure
    import numpy as np
    
    binary_image = np.array([
            [0,1,0,0,1,1,0,1,1,0,0,1],
            [0,1,0,1,1,1,0,1,1,1,0,1],
            [0,0,0,0,0,0,0,1,1,1,0,0],
            [0,1,1,1,1,0,0,0,0,1,0,0],
            [0,0,0,0,0,0,0,1,1,1,0,0],
            [0,0,1,0,0,0,0,0,0,0,0,0],
            [0,1,0,0,1,1,0,1,1,0,0,1],
            [0,0,0,0,0,0,0,1,1,1,0,0],
            [0,1,1,1,1,0,0,0,0,1,0,0],
            ])
    
    print(f"\n\n2d array of type: {type(binary_image)}:")
    print(binary_image)
    
    labels = measure.label(binary_image)
    
    print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
    print(labels)
    
    def extract_blobs_from_labelled_array(labelled_array):
        """Return a list containing coordinates of pixels in each blob."""
        props = measure.regionprops(labelled_array)
        blobs = [p.coords for p in props]
        return blobs
    
    
    if __name__ == "__main__":
        print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
        print("Time taken:")
        print(
            timeit.timeit(
                'extract_blobs_from_labelled_array(labels)', 
                globals=globals(),
                number=1
                )
            )
        print("\n\n")