cimage-processingopenmpsimdispc

Optimize a Separable Convolution for SIMD Friendly and Efficiency


I implemented a simple 2d separable convolution code:

void SeparableConvolution2D(const float * restrict mI, float * restrict mO, float * restrict mB, int numRows, int numCols, const float * restrict vKRow, int rowKernelLen, const float * restrict vKCol, int colKernelLen) 
{
    // mI (numRows x numCols) - The input.
    // mO (numRowsO x numColsO) - The output.
    // mB (numColsO x numRows) - The buffer.
    // Intermediate buffer (transposed version of the row filtered image)
    // - Parallelization overhead is too big.
    // - GCC native SIMD is better than OpenMP.

    int rowKernelRad = rowKernelLen / 2; // Radius
    int colKernelRad = colKernelLen / 2; // Radius
    
    // Output dimensions for "Valid" convolution
    int numRowsO = numRows - (colKernelLen - 1);
    int numColsO = numCols - (rowKernelLen - 1);
    
    // Apply 1D filter along rows
    for (int r = 0; r < numRows; r++) 
    {
        #pragma omp simd
        for (int c = 0; c < numColsO; c++)
        {
            float sum = 0.0f;
            for (int k = 0; k < rowKernelLen; k++) 
            {
                sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
            }
            mB[c * numRows + r] = sum; // Store row result
        }
    }
    
    // Apply 1D filter along columns
    // The data in `mB` is transposed, hence again working along the rows
    for (int r = 0; r < numColsO; r++) 
    {
        #pragma omp simd
        for (int c = 0; c < numRowsO; c++) 
        {
            float sum = 0.0f;
            for (int k = 0; k < colKernelLen; k++) 
            {
                sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
            }
            mO[c * numColsO + r] = sum; // Store final result
        }
    }

}

I want it to be SIMD efficient, yet portable.
Hence I don't want to use any SIMD intrinsic and also no added dependency to the code.
I tried using OpenMP SIMD though not with much success.

The objective is to vectorize the loop over c.
Yet it seems the compiler only vectorize the calculation of the sum variable.

Is there a way to achieve vectorization on the c loop?
Any way to hint the compiler? Any OpenMP SIMD magic?

The target compiler is GCC.

Optional Transformation

For instance, one transformation I thought is by explicit unrolling.
For instance, the first iteration would become:

for (int r = 0; r < numRows; r++) 
    {
        vI = mI + (r * numCols); // Pointer to the current row
        // #pragma omp simd
        #pragma GCC ivdep
        #pragma vector always
        for (int c = 0; c + 4 < numColsO; c += 4) 
        {
            // Load 4 pixels from the input image
            FLOAT_TYPE * vI0 = &vI[c + rowKernelLen1];
            FLOAT_TYPE * vI1 = &vI[c + rowKernelLen1 + 1];
            FLOAT_TYPE * vI2 = &vI[c + rowKernelLen1 + 2];
            FLOAT_TYPE * vI3 = &vI[c + rowKernelLen1 + 3];

            // Apply the kernel to each pixel
            FLOAT_TYPE sum0 = 0.0f;
            FLOAT_TYPE sum1 = 0.0f;
            FLOAT_TYPE sum2 = 0.0f;
            FLOAT_TYPE sum3 = 0.0f;

            for (int k = 0; k < rowKernelLen; k++) 
            {
                sum0 += vI0[- k] * vKRow[k];
                sum1 += vI1[- k] * vKRow[k];
                sum2 += vI2[- k] * vKRow[k];
                sum3 += vI3[- k] * vKRow[k];
            }
            
            mB[c * numRows + r]       = sum0; // Store row result
            mB[(c + 1) * numRows + r] = sum1; // Store row result
            mB[(c + 2) * numRows + r] = sum2; // Store row result
            mB[(c + 3) * numRows + r] = sum3; // Store row result
        }
    }

The motivation is to "hint" the compiler it should broadcast the multiplication by the kernel value, vKRow[k] over multiple image pixels. Will that achieve this?

Update

I marked @JérômeRichard's answer though the actual exact code I use is the one in my answer. Yet @JérômeRichard's answer is what gave me access to ISPC which is the tool to force the correct SIMD in this case.


Solution

  • The objective is to vectorize the loop over c.

    This is the standard way of doing that in OpenMP:

        // Apply 1D filter along rows
        for (int r = 0; r < numRows; r++) 
        {
            #pragma omp simd
            for (int c = 0; c < numColsO; c++)
            {
                float sum = 0.0f;
                for (int k = 0; k < rowKernelLen; k++) 
                    sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
                mB[c * numRows + r] = sum;
            }
        }
    

    In practice though, most compilers cannot vectorise the loop because this is too complicated to do currently for them. More specifically, this is because of the inner loop with a runtime-defined number of iterations as pointed out by Homer512 in comments. For example, ICC 2021.10 succeeds to vectorise the code correctly (i.e. as requested); ICX 2024.0 succeeds to do that but it seems not to vectorise it correctly as the warning raised indicates; GCC 14.2 and Clang 20.1 fail to (efficiently) vectorise the code in this case. See on Godbolt.

    Yet it seems the compiler only vectorize the calculation of the sum variable.

    This is because AFAIK GCC internally rely on auto-vectorisation and does not really/fully implement (most of) the SIMD directive of OpenMP. Put it shortly: it does not really care about your OpenMP hints.

    One cheat is sometime to use omp declare simd and call the SIMD-friendly function so to help the compiler to know how to vectorise the loop. But this does not appear to work here (see my try on Godbolt).

    Even if it would care, I do not expect the GCC optimiser to do such kind of transformation of the loop (as opposed to ICC which is know to do that pretty well, even sometimes automatically).

    Overall, this is the problem with (OpenMP) compiler hints: in many cases, not implementing anything is completely compliant to the standard which does not require the implementation to actually generate a vectorised code.

    If you want stronger guarantee on vectorisation while keeping a portable code, then you should pick one of the following option:

    Adapting the code so to help the optimiser to generate a SIMD code, like splitting the loop in tiles with a fixed-size known at compile time (as suggested by Homer512) increases the probability for the compiler to vectorise the code, but this is still far from being guaranteed and still brittle (it may fails on another version of the same compiler and I often seen that). Using OpenMP directive tends to improve the situation but it is still rather brittle in non-trivial cases. In Clang, this solution can sometimes make the code significantly less efficient due to bad choices made by the SLP vectoriser. Moreover, this solution tends to make the code significantly more complicated in practice because of additional loops and temporary buffers for tiling and also because of duplicated code due to the loops computing the last items not fitting in SIMD registers (similar to what we do with intrinsics on SIMD instruction sets not supporting masking like SSE/AVX/Neon as opposed to AVX-512/SVE).

    There is another thing to consider: mO[c * numRowsO + r] = sum; cannot be vectorised efficiently on all mainstream x86/ARM CPU yet when the c-based loop is vectorised. Indeed, the memory stores are stridded ones and not contiguous. There is a scatter instruction since AVX2 for such memory access pattern but it is very inefficient on all CPUs so far (often even slower than scalar stores). AFAIK, on ARM, SVE also supports this, but most ARM CPUs do not support SVE yet (some ARM server CPU do like the latest Graviton CPUs). Even when it is implemented rather efficiently, such scatter stores are expensive because they operate on a lot of cache lines simultaneously. It is certainly more efficient to write data contiguously in a temporary buffer and only then transpose it.


    Example using ISPC

    For a CPU code, I would personally pick ISPC which is often able to do such kind of non-trivial vectorisation pretty well. If you are curious, here is a naive ISPC implementation I quickly wrote (very close to your C code):

    void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[], 
                                uniform int numRows, uniform int numCols, 
                                const uniform float vKRow[], uniform int rowKernelLen, 
                                const uniform float vKCol[], uniform int colKernelLen) 
    {
        uniform int rowKernelRad = rowKernelLen / 2;
        uniform int colKernelRad = colKernelLen / 2;
    
        uniform int numRowsO = numRows - (colKernelLen - 1);
        uniform int numColsO = numCols - (rowKernelLen - 1);
        
        // Apply 1D filter along rows
        for (uniform int r = 0; r < numRows; r++) 
        {
            foreach (c = 0...numColsO)
            {
                float sum = 0.0f;
                #pragma nounroll
                for (uniform int k = 0; k < rowKernelLen; k++) 
                    sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
                mB[c * numRows + r] = sum;
            }
        }
        
        // Apply 1D filter along columns
        for (uniform int r = 0; r < numColsO; r++) 
        {
            foreach (c = 0...numRowsO)
            {
                float sum = 0.0f;
                #pragma nounroll
                for (uniform int k = 0; k < colKernelLen; k++) 
                    sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
                mO[c * numColsO + r] = sum;
            }
        }
    }
    

    This implementation seems to generate the expected SIMD assembly code (see on Godbold):

    ; Compute 8 items per iteration using AVX2
    .LBB0_7:
            movsxd  rbx, ebx
            vbroadcastss    ymm6, dword ptr [r9 + 4*rax]
            vfmadd231ps     ymm5, ymm6, ymmword ptr [rdi + rbx]
            inc     rax
            add     ebx, -4
            cmp     r12, rax
            jne     .LBB0_7
    

    Note that uniform variable can be seen as scalar shared amongst all SIMD lanes and varying are local to each lane. Variables are varying one by default (which can be harmful for performance sometimes). I advise you to read the ISPC manual for more information about this. The programming model is rather close to the one of OpenCL and CUDA.

    By the way, ISPC also warns us about the inefficient scatter stores (mentioned above). The code should be optimised to avoid these horrible scatter stores. That being said, doing an efficient portable 2D transposition in ISPC can be quite challenging (like with most alternatives actually). You can still try to do a naive transposition like in C (same code but with uniform added before all variables) and see if it is fast enough to you.

    Note the above assembly code is clearly sub-optimal because of the latency of the vfmadd231ps instruction (typically 3-4 cycles) and the dependency on the previous iteration. That being said, ISPC does exactly what we ask for a loop adding numbers in a SIMD-friendly way. Fortunately, ISPC supports a feature called double-pumping designed to mitigate such latency issue and better use SIMD units (see on Godbolt):

    ; Compute 16 items per iteration using AVX2
    .LBB0_26:
            movsxd  r13, r13d
            vbroadcastss    ymm8, dword ptr [r9 + 4*r8]
            vfmadd231ps     ymm6, ymm8, ymmword ptr [rdi + r13 + 32]
            vfmadd231ps     ymm7, ymm8, ymmword ptr [rdi + r13]
            inc     r8
            add     r13d, -4
            cmp     r12, r8
            jne     .LBB0_26
    

    This is still sub-optimal but twice better. On CPUs supporting AVX-512, quad-pumping is available resulting in a pretty-efficient SIMD assembly code without even changing the code! It should be significantly faster than a scalar implementation or even the one of GCC (except for huge kernel filters). The code is also easy to read and maintain.


    Update:

    After discussing with @chtz in comments, we can operate on a transposed mB (assuming this variable is only meant to be used as a temporary buffer) and change the vectorisation axis in the second filtering step while also swapping the loops for sake of performance. The code should look like this:

    void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[], 
                                uniform int numRows, uniform int numCols, 
                                const uniform float vKRow[], uniform int rowKernelLen, 
                                const uniform float vKCol[], uniform int colKernelLen) 
    {
        uniform int numRowsO = numRows - (colKernelLen - 1);
        uniform int numColsO = numCols - (rowKernelLen - 1);
        
        // Apply 1D filter along rows
        for (uniform int r = 0; r < numRows; r++) 
        {
            foreach (c = 0...numColsO)
            {
                float sum = 0.0f;
                #pragma nounroll
                for (uniform int k = 0; k < rowKernelLen; k++) 
                    sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
                mB[r * numCols + c] = sum;
            }
        }
        
        // Apply 1D filter along columns
        for (uniform int c = 0; c < numRowsO; c++) 
        {
            foreach (r = 0...numColsO)
            {
                float sum = 0.0f;
                #pragma nounroll
                for (uniform int k = 0; k < colKernelLen; k++) 
                    sum += mB[((c - k + colKernelLen - 1) * numCols) + r] * vKCol[k];
                mO[c * numColsO + r] = sum;
            }
        }
    }