I am working on a function to process 3D images voxel-by-voxel. For each voxel, I compute the difference between the voxel value and its 3x3x3 neighborhood, apply distance-based scaling, and determine a label based on the maximum scaled difference. However, my current implementation is slow, and I'm looking for ways to optimize it.
import numpy as np
from tqdm import tqdm import itertools
def create_label(image_input):
# Define neighborhood offsets and distances
locations = np.array([tuple([dx, dy, dz]) for dx, dy, dz in itertools.product(range(-1, 2), repeat=3)])
euclidian_distances = np.linalg.norm(locations, axis=1)
euclidian_distances_inverse = np.divide(1, euclidian_distances, out=np.zeros_like(euclidian_distances), where=euclidian_distances != 0)
image_input = image_input.astype(np.float32)
label = np.zeros_like(image_input)
image_shape = image_input.shape
# Process each voxel (excluding the edges)
for i in tqdm(range(1, image_shape[0] - 1)):
for j in range(1, image_shape[1] - 1):
for k in range(1, image_shape[2] - 1):
# Extract 3x3x3 neighborhood values
all_values_neighborhood = [image_input[loc[0] + i, loc[1] + j, loc[2] + k] for loc in locations]
# Compute differences and scale by distances
centre_minus_rest = image_input[i, j, k, None] - np.array(all_values_neighborhood)
if np.all(centre_minus_rest < 0):
label[i, j, k] = 13
else:
centre_minus_rest_divided = centre_minus_rest * euclidian_distances_inverse
centre_minus_rest_divided[13] = -100 # Ignore the center value
class_label = np.argmax(centre_minus_rest_divided)
label[i, j, k] = class_label
return label
# example 3d array
image_input = np.random.rand(10, 10, 10).astype(np.float32)
# run the function
label = create_label(image_input)
print(label)
You could use a sliding_window_view
, and vectorize the for loops.
One attempt
def create_label2(image_input):
# Define neighborhood offsets and distances
locations = np.array([tuple([dx, dy, dz]) for dx, dy, dz in itertools.product(range(-1, 2), repeat=3)])
euclidian_distances = np.linalg.norm(locations, axis=1)
euclidian_distances_inverse = np.divide(1, euclidian_distances, out=np.zeros_like(euclidian_distances), where=euclidian_distances != 0)
image_input = image_input.astype(np.float32)
label = np.zeros_like(image_input)
image_shape = image_input.shape
neighborhood = np.lib.stride_tricks.sliding_window_view(image_input, (3,3,3))
# Process each voxel (excluding the edges)
centre_minus_rest = image_input[1:-1, 1:-1, 1:-1, None,None,None] - neighborhood
centre_minus_rest_divided = centre_minus_rest * euclidian_distances_inverse.reshape(1,1,1,3,3,3)
centre_minus_rest_divided[:,:,:,1,1,1] = -100
class_label = np.argmax(centre_minus_rest_divided.reshape(image_shape[0]-2, image_shape[1]-2, image_shape[2]-2,27), axis=3)
label[1:-1,1:-1,1:-1] = np.where((centre_minus_rest<0).all(axis=(3,4,5)), 13, class_label)
return label
On my machine, that is a ×70 gain
What it does is building a (P-2)×(H-2)×(W-2)×3×3×3 array of all 3x3x3 cubes around each z,y,x points in 1..P-1, 1..H-1, 1..W-1
From there, it is then possible to computes in single vectorized operations what you did in for
loops
I said "attempt", which may not seem very confident, because that is quite memory consuming. Not sliding_window_view
itself, that costs nothing: it is just a view, with no own data in memory (the memory is the one of image_input
. So neighborhood[12,13,14,1,1,1]
is the same as neighborhood[13,14,15,0,0,0]
. Not just the same value: the same data, the same memory place. If you neighborhood[12,13,14,1,1,1]=999
, and then print(neighborhood[13,14,15,0,0,0])
or print(neighborhood[13,12,13,0,2,2])
, it prints your 999
back.
So neighborhood
cost nothing. And its shape (that amounts to, roughly, 27 times the volume of initial image) is "virtual".
But, still, even if it is not using actual memory, that is an array of almost 27 times more values than in the image (each being repeated 27 times — "almost" part being because of edges). So, as soon as you do some computation on it, you get such an array, with this times all values in its own place in memory.
In other words, if you start from a 100x100x100 image, that uses 1000000 memory places (4000000 bytes, since those are np.float32
), neighborhood
cost nothing: it is the same 4000000 bytes that makes its almost 27000000 values.
But centre_minus_rest
, and then the subsequent computation do occupy "almost" 108000000 bytes in memory.
So, depending on the size of your image (and the memory available on your machine), we may have replaced a cpu-time problem by a memory problem by doing that.
Compromises are possible tho.
you could keep your first first loop (for i in range...
), and then vectorize the computation for all j and k. Anyway, inner loops are the one that cost the more, and need to be vectorized. Not vectorizing outer loop is not that serious. In my 100x100x100 example, that means that I have 100 non-vectorized iterations and 1010000 vectorized ones... it is not a huge regression for the previous 1010100 vectorized for
iterations
for i in range(1, image_shape[0]-1):
centre_minus_rest = image_input[i, 1:-1, 1:-1, None,None,None] - neighborhood[i-1]
centre_minus_rest_divided = centre_minus_rest * euclidian_distances_inverse.reshape(1,1,3,3,3)
centre_minus_rest_divided[:,:,1,1,1] = -100
class_label = np.argmax(centre_minus_rest_divided.reshape(image_shape[1]-2, image_shape[2]-2,27), axis=2)
label[i,1:-1,1:-1] = np.where((centre_minus_rest<0).all(axis=(2,3,4)), 13, class_label)
That is as fast on my machine. Still have the same ×70 gain, with no measurable decrease. And yet, intermediates values (centre_minus_rest
, ...) are 3 times smaller in memory.
Note that this is still not ideal code. I should pick size once for all between having 3x3x3 subarrays of values, euclideans distances, ..., or having a 1D array of size 27.
(But sliding_window_view
needs a 3x3x3 array, otherwise it can do its magic => reshaping into (P-2,H-2,W-2,27) would necessitate a full copy ; and on the contrary, .argmax
can't work along more than 1 axis, and needs length-27 arrays)
Also, the itertools
trick is very pythonic, but not at all numpythonic.
Something like np.arange(3)[:,None,None]**2+np.arange(3)[None,:,None]**2+np.arange(3)[None,None,:]
...
But that is anyway outside the costly loop
It must also be noted that one inconvenient of vectorization is that it sometimes forces you to compute more things that you need (but if you compute twice too much things, but 100 times faster, that is a x50 time gain).
Here, my np.where((centre_minus_rest<0).all(axis=(2,3,4)), 13, class_label)
means that for each value for which answer to that is 13, I have computed class_label
in vain. Which your for for for if
wasn't.
So take it more as a first lead on how to do it, more than as a definitive solution.
But, still, ×70 is a good start.