SIMD in AssemblyScript

Hey I created a Box Blur algorithm in AssemblyScript.

To make it more efficient, I would like to use SIMD Operations.

For example I have which looks like this:

    for(let column: i16 = x + deviationBase + 1; column < x + width - deviationBase; column++){
        r += load<u8>(rowPosition + (column + deviationBase) * 4    )
        g += load<u8>(rowPosition + (column + deviationBase) * 4 + 1) 
        b += load<u8>(rowPosition + (column + deviationBase) * 4 + 2)
        r -= load<u8>(rowPosition + (column - deviationBase) * 4    ) 
        g -= load<u8>(rowPosition + (column - deviationBase) * 4 + 1)
        b -= load<u8>(rowPosition + (column - deviationBase) * 4 + 2)
        
        store<u8>(rowPosition + column * 4    , (r / diameter) as u8)
        store<u8>(rowPosition + column * 4 + 1, (g / diameter) as u8)
        store<u8>(rowPosition + column * 4 + 2, (b / diameter) as u8)
      }

How can I make this for loop faster with SIMD? It would be nice if you could also give me some explanations :)

Solution

I tried to tackle this problem, my attempt is here.

So, the usual way of dealing with simd, AssemblyScript being no different, is the following:

load the data into simd registers from regular registers or memory
compute the vectorized calculations
store the data back into regular registers or memory

AssemblyScript has one simd datatype: v128, which can store 16 u8's. We will use it in order to do simd calculations.

I built a box blur example using this resource. I took some decisions in order to remove edge-cases and focus only on the simd part:

I used the rgba format, as opposed to the rgb format. Keeping things as powers of two helps reduce edge-cases, and better fits for the v128.
I padded the image that I am blurring with pixels with values of 0, in order to reduce the complexity of the algorithm. There is no need for bounds checking now.

Now, for the actual algorithms: I implemented three algorithms. The box_blur_naive, box_blur_improved and box_blur_simd.

In the box_blur_naive there is the familiar loading and adding for every pixel and its neighbours. For example, for the red color:

red += load<u8>(img + i * width * 4 + j * 4 + 0);
red += load<u8>(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i + 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i) * width * 4 + (j - 1) * 4 + 0);
red += load<u8>(img + (i) * width * 4 + (j + 1) * 4 + 0);

Benchmarking time: 29.430s

In order to simdize this, we need to rearrange things a bit. First, the vertical lines (i, i - 1, i + 1) are very easily convertible to simd instructions. The problem is that there is no easy way to add the horizontal neighbours with v128, because they all end up in the same register.

Fortunately, there is an easy way for box blur to separate the horizontal and vertical additions, with the help of a transposed image, which is what box_blur_improved does:

red += load<u8>(img + i * width * 4 + j * 4 + 0);
red += load<u8>(img + (i - 1) * width * 4 + (j) * 4 + 0);
red += load<u8>(img + (i + 1) * width * 4 + (j) * 4 + 0);
...
red += load<u8>(transposed_img + (i - 1) * height * 4 + (j) * 4 + 0);
red += load<u8>(transposed_img + (i + 1) * height * 4 + (j) * 4 + 0);

Benchmarking time: 30.225s

Now, we only have vertical additions, so we can finally start bringing in v128. Getting the data into v128:

line = v128.load(img + i * width * 4 + 4 + k * 16);
line_before = v128.load(img + (i - 1) * width * 4 + 4 + k * 16);
line_after = v128.load(img + (i + 1) * width * 4 + 4 + k * 16);

This does the same thing as the loads before, but for 16 u8 values at a single time. We can now perform the additions:

let result = v128.add<u8>(line, v128.add<u8>(line_before, line_after));

This line performs all the vertical additions from before. We have not added the transposed additions, because of a future problem that I will explain shortly.

v128.store(blurred_img + i * width * 4 + 4 + k * 16, result);

This stores the result at the specified address. And that is all.

Benchmarking time: 17.637s

The simd solutions seems to save about half the time, and the code is not fully simdized.

The last problem is that there is no easy way to do integer division in an easy way with v128 (and simd in general).

Consider the way in which we store the data in memory in the first algorithms:

(load<u8>(blurred_img + j * width * 4 + i * 4 + 0) + red) / 5) as u8

We have to do division by 5, which is not an operation on v128. There may be some ways of doing the division by using bit shifts and subtractions. Judging from the speed increase we already obtained, it may be worthwhile to do it.