python numpy performance optimization vectorization

How to efficiently compute and process 3x3x3 voxel neighborhoods in a 3D NumPy array?

I am working on a function to process 3D images voxel-by-voxel. For each voxel, I compute the difference between the voxel value and its 3x3x3 neighborhood, apply distance-based scaling, and determine a label based on the maximum scaled difference. However, my current implementation is slow, and I'm looking for ways to optimize it.

import numpy as np 
from tqdm import tqdm import itertools

def create_label(image_input): 
    # Define neighborhood offsets and distances 
    locations = np.array([tuple([dx, dy, dz]) for dx, dy, dz in itertools.product(range(-1, 2), repeat=3)]) 
    euclidian_distances = np.linalg.norm(locations, axis=1) 
    euclidian_distances_inverse = np.divide(1, euclidian_distances, out=np.zeros_like(euclidian_distances), where=euclidian_distances != 0)
    
    image_input = image_input.astype(np.float32)
    label = np.zeros_like(image_input)
    image_shape = image_input.shape
    
    # Process each voxel (excluding the edges)
    for i in tqdm(range(1, image_shape[0] - 1)):
        for j in range(1, image_shape[1] - 1):
            for k in range(1, image_shape[2] - 1):
                # Extract 3x3x3 neighborhood values
                all_values_neighborhood = [image_input[loc[0] + i, loc[1] + j, loc[2] + k] for loc in locations]
    
                # Compute differences and scale by distances
                centre_minus_rest = image_input[i, j, k, None] - np.array(all_values_neighborhood)
                if np.all(centre_minus_rest < 0):  
                    label[i, j, k] = 13
                else:
                    centre_minus_rest_divided = centre_minus_rest * euclidian_distances_inverse
                    centre_minus_rest_divided[13] = -100  # Ignore the center value
                    class_label = np.argmax(centre_minus_rest_divided)
                    label[i, j, k] = class_label

return label

# example 3d array
image_input = np.random.rand(10, 10, 10).astype(np.float32)
# run the function
label = create_label(image_input) 
print(label)

Solution

You could use a sliding_window_view, and vectorize the for loops.

One attempt

def create_label2(image_input):
    # Define neighborhood offsets and distances 
    locations = np.array([tuple([dx, dy, dz]) for dx, dy, dz in itertools.product(range(-1, 2), repeat=3)])
    euclidian_distances = np.linalg.norm(locations, axis=1)
    euclidian_distances_inverse = np.divide(1, euclidian_distances, out=np.zeros_like(euclidian_distances), where=euclidian_distances != 0)

    image_input = image_input.astype(np.float32)
    label = np.zeros_like(image_input)
    image_shape = image_input.shape

    neighborhood = np.lib.stride_tricks.sliding_window_view(image_input, (3,3,3))

    # Process each voxel (excluding the edges)
    centre_minus_rest = image_input[1:-1, 1:-1, 1:-1, None,None,None] - neighborhood
    centre_minus_rest_divided = centre_minus_rest * euclidian_distances_inverse.reshape(1,1,1,3,3,3)
    centre_minus_rest_divided[:,:,:,1,1,1] = -100
    class_label = np.argmax(centre_minus_rest_divided.reshape(image_shape[0]-2, image_shape[1]-2, image_shape[2]-2,27), axis=3)
    label[1:-1,1:-1,1:-1] = np.where((centre_minus_rest<0).all(axis=(3,4,5)), 13, class_label)

    return label

On my machine, that is a ×70 gain

What it does is building a (P-2)×(H-2)×(W-2)×3×3×3 array of all 3x3x3 cubes around each z,y,x points in 1..P-1, 1..H-1, 1..W-1

From there, it is then possible to computes in single vectorized operations what you did in for loops

I said "attempt", which may not seem very confident, because that is quite memory consuming. Not sliding_window_view itself, that costs nothing: it is just a view, with no own data in memory (the memory is the one of image_input. So neighborhood[12,13,14,1,1,1] is the same as neighborhood[13,14,15,0,0,0]. Not just the same value: the same data, the same memory place. If you neighborhood[12,13,14,1,1,1]=999, and then print(neighborhood[13,14,15,0,0,0]) or print(neighborhood[13,12,13,0,2,2]), it prints your 999 back.

So neighborhood cost nothing. And its shape (that amounts to, roughly, 27 times the volume of initial image) is "virtual". But, still, even if it is not using actual memory, that is an array of almost 27 times more values than in the image (each being repeated 27 times — "almost" part being because of edges). So, as soon as you do some computation on it, you get such an array, with this times all values in its own place in memory.

In other words, if you start from a 100x100x100 image, that uses 1000000 memory places (4000000 bytes, since those are np.float32), neighborhood cost nothing: it is the same 4000000 bytes that makes its almost 27000000 values. But centre_minus_rest, and then the subsequent computation do occupy "almost" 108000000 bytes in memory.

So, depending on the size of your image (and the memory available on your machine), we may have replaced a cpu-time problem by a memory problem by doing that.

Compromises are possible tho.

you could keep your first first loop (for i in range...), and then vectorize the computation for all j and k. Anyway, inner loops are the one that cost the more, and need to be vectorized. Not vectorizing outer loop is not that serious. In my 100x100x100 example, that means that I have 100 non-vectorized iterations and 1010000 vectorized ones... it is not a huge regression for the previous 1010100 vectorized for iterations

for i in range(1, image_shape[0]-1):
        centre_minus_rest = image_input[i, 1:-1, 1:-1, None,None,None] - neighborhood[i-1]
        centre_minus_rest_divided = centre_minus_rest * euclidian_distances_inverse.reshape(1,1,3,3,3)
        centre_minus_rest_divided[:,:,1,1,1] = -100
        class_label = np.argmax(centre_minus_rest_divided.reshape(image_shape[1]-2, image_shape[2]-2,27), axis=2)
        label[i,1:-1,1:-1] = np.where((centre_minus_rest<0).all(axis=(2,3,4)), 13, class_label)

That is as fast on my machine. Still have the same ×70 gain, with no measurable decrease. And yet, intermediates values (centre_minus_rest, ...) are 3 times smaller in memory.

Note that this is still not ideal code. I should pick size once for all between having 3x3x3 subarrays of values, euclideans distances, ..., or having a 1D array of size 27. (But sliding_window_view needs a 3x3x3 array, otherwise it can do its magic => reshaping into (P-2,H-2,W-2,27) would necessitate a full copy ; and on the contrary, .argmax can't work along more than 1 axis, and needs length-27 arrays)

Also, the itertools trick is very pythonic, but not at all numpythonic. Something like np.arange(3)[:,None,None]**2+np.arange(3)[None,:,None]**2+np.arange(3)[None,None,:]... But that is anyway outside the costly loop

It must also be noted that one inconvenient of vectorization is that it sometimes forces you to compute more things that you need (but if you compute twice too much things, but 100 times faster, that is a x50 time gain).

Here, my np.where((centre_minus_rest<0).all(axis=(2,3,4)), 13, class_label) means that for each value for which answer to that is 13, I have computed class_label in vain. Which your for for for if wasn't.

So take it more as a first lead on how to do it, more than as a definitive solution.

But, still, ×70 is a good start.