c++optimizationavxavx2dot-product

Fast int32_t dot product of two C++ integer vectors using AVX is not faster


I've implemented a simple dot product of two vectors :

int dotProductNoAVX(const std::vector<int>& vec1, const std::vector<int>& vec2)
{
    // Calculate dot product using simple for loops
    int dotProduct = 0;
    for (std::size_t i = 0; i < vec1.size(); ++i)
    {
        dotProduct += vec1[i] * vec2[i];
    }

    return dotProduct;
}

And implemented a "faster" version using AVX256 :

// Function to calculate the dot product of two vectors using AVX intrinsics
int dotProductAVX(const std::vector<int>& vec1, const std::vector<int>& vec2)
{
    // Ensure the size of vectors is a multiple of 8 (size of AVX register)
    std::size_t size = vec1.size();
    std::size_t avxSize = size / 8 * 8;

    // Initialize variables for AVX operations
    __m256i sum = _mm256_setzero_si256();

    // Perform AVX dot product
    for (std::size_t i = 0; i < avxSize; i += 8)
    {
        __m256i vec1_part = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vec1[i]));
        __m256i vec2_part = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vec2[i]));
        sum = _mm256_add_epi32(sum, _mm256_mullo_epi32(vec1_part, vec2_part));
    }

    // Sum the remaining elements using scalar operations
    int result[8];
    _mm256_storeu_si256(reinterpret_cast<__m256i*>(result), sum);
    int dotProduct = result[0] + result[1] + result[2] + result[3] + result[4] + result[5] + result[6] + result[7];

    // Sum the remaining elements using scalar operations
    for (std::size_t i = avxSize; i < size; ++i)
    {
        dotProduct += vec1[i] * vec2[i];
    }

    return dotProduct;
}

I get roughly the same speed. sometimes the AVX is 10-20% faster.

I'm using VS2017 and I7-12700k (which I support AVX as far as i understand)

I've even tried to profile it, but nothing too suspicious. The only thing is that this line :

    int dotProduct = result[0] + result[1] + result[2] + result[3] + result[4] + result[5] + result[6] + result[7];

takes long time, but not sure how to improve it

I'm testing using this main function,

std::vector<int> generateRandomVector(std::size_t size)
{
    std::vector<int> randomVector;
    randomVector.reserve(size);

    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<int> distribution(1, 100);

    for (std::size_t i = 0; i < size; ++i)
    {
        randomVector.push_back(distribution(gen));
    }

    return randomVector;
}

int main()
{
    // Specify the size of the random vectors
    std::size_t vectorSize = 100;
    int iterations = 1000000;
    // Generate two random vectors
    std::vector<int> vector1 = generateRandomVector(vectorSize);
    std::vector<int> vector2 = generateRandomVector(vectorSize);

    // Measure time for dot product with AVX
    int resultAVX;
    auto startAVX = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; i++)
        resultAVX = dotProductAVX(vector1, vector2);
    //resultAVX = dotProductAVX512(vector1, vector2);
    auto endAVX = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> durationAVX = endAVX - startAVX;

    int resultNoAVX;
    auto startNoAVX = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; i++)
        resultNoAVX = dotProductNoAVX(vector1, vector2);
    auto endNoAVX = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> durationNoAVX = endNoAVX - startNoAVX;

    // Display the result and time taken
    std::cout << "Dot product without AVX: " << resultNoAVX << std::endl;
    std::cout << "Time taken without AVX: " << durationNoAVX.count() << " seconds" << std::endl;



    // Display the result and time taken
    std::cout << "Dot product with AVX: " << resultAVX << std::endl;
    std::cout << "Time taken with AVX: " << durationAVX.count() << " seconds" << std::endl;


    return 0;
}

Solution

  • TL:DR: your cleanup is even worse than what MSVC does when auto-vectorizing, and that's significant for short vectors. Optimizing your cleanup code should make a dot product that's a bit faster than MSVC's auto-vectorized code, at the cost of specializing it for a specific SIMD width.

    For larger sizes, unrolling the main loop by 2 would help especially on Intel CPUs where vpmulld is 2 uops, or 3 with a memory source operand.


    Your benchmark is probably valid; even though the compiler could hoist the work out of the repeat loop1, MSVC 19.37 chooses not to inline the dot product functions so that doesn't happen.

    MSVC auto-vectorizes the scalar version (with SSE4.1 pmulld with -O2 without any arch options, or with -O2 -arch:AVX2 using vpmulld ymm like your intrinsics.) MSVC's auto-vectorized version unrolls the loop by 2 vectors, which is probably good. In your manual loop, front-end throughput is the bottleneck even on 6-wide Alder Lake P-cores (Golden Cove). Hopefully you're pinning your benchmark to a P core or an E core, depending on which one you want to measure.

    Your vectors are short (only 100 elements) so scalar cleanup of leftover elements, and the horizontal sum, are important factors. 100 elements is only 12.5 vectors of 8 elements each.

    The result[0] + result[1] + ... compiles even worse than you'd expect (Godbolt): MSVC "optimizes" the vector store to an array + scalar element access into into shuffles like vextractf128 xmm0, ymm2, 1 / vpextrd eax, xmm2, 1 / add ebx, eax when you compile with -arch:AVX2. Yes, repeating the same vextractf128 for all four elements of the high half. Or worse without that, using vextractf128 xmm0, ymm2, 1 / vpsrldq xmm0, xmm0, 4 / vmovd eax, xmm0 (although that's still the same amount of shuffle uops; vpextrd r, x, imm is a 2 uop instruction. https://uops.info/) Store and scalar reload would be faster for accessing each element one at a time. But either way is slower than an efficient hsum with manual shuffling.

    See hsum_8x32 in Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2 (and Fastest way to do horizontal SSE vector sum (or other reduction) in general) for how to horizontally sum a vector of int32_t elements. As always, shuffle the high half down to line up with the low half for _mm_add_epi32, narrowing until you're down to 1 element2.

    MSVC's auto-vectorized code dotProductNoAVX uses a vectorized horizontal sum that's not as efficient as it could be, especially for E-cores (using vphaddd ymm which is 2 shuffle + 1 add uop on P-cores, or twice that on E cores since it foolishly doesn't reduce to 128-bit vectors as the first step.)

    MSVC also auto-vectorizes your scalar cleanup loop, not realizing that it only runs 0..7 iterations. This doesn't cost much, just an extra branch to skip the vectorized version, but one way to work around that is to make it more obvious to the compiler that the loop range is small, just size%8.

        // Sum the remaining elements using scalar operations
        // MSVC realizes this shouldn't vectorize, unlike the previous version
        for (std::size_t i = 0; i < size%8; ++i) {
            dotProduct += vec1[avxSize+i] * vec2[avxSize+i];
        }
    

    Also, doing a 16-byte vector as part of the cleanup can be worth it, saving 4 scalar iterations. For your element count of 100, 100 % 8 is 4, so there's exactly 1 __m128i vector left after doing 12x __m256i vectors. 100 % 16 == 4 as well, so unrolling your main loop by 2 vectors wouldn't create extra cleanup work if your actual use-case is vectors of exactly 100 elements.

    You can fold this final 128-bit vector into the horizontal-sum part of the cleanup by adding it in after you've reduced the main loop's result down to one __m128i.

    // dot product of two vectors using AVX2 intrinsics
    // optimized for small sizes
    // UNTESTED
    int dotProductAVX(const std::vector<int>& vec1, const std::vector<int>& vec2)
    {
        // Ensure the size of vectors is a multiple of 8 (size of AVX register)
        std::size_t size = vec1.size();
        std::size_t avxSize = size / 8 * 8;
    
        // Perform AVX dot product
        __m256i sum = _mm256_setzero_si256();
        for (std::size_t i = 0; i < avxSize; i += 8)
        {
            __m256i vec1_part = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vec1[i]));
            __m256i vec2_part = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vec2[i]));
            sum = _mm256_add_epi32(sum, _mm256_mullo_epi32(vec1_part, vec2_part));
            // optionally unroll; also change i+=16 and avxSize calc
            // and improve the cleanup with either a __m128i or __m256i loop
            // vec1_part = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vec1[i+8]));
            // vec2_part = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&vec2[i+8]));
            //sum = _mm256_add_epi32(sum, _mm256_mullo_epi32(vec1_part, vec2_part));
            // clang will unroll this loop for us
        }
    
        __m128i sum128 = _mm_add_epi32( 
                     _mm256_castsi256_si128(sum),
                     _mm256_extracti128_si256(sum, 1));
    
        std::size_t cleanupSize = size%8;
        if (cleanupSize >= 4) {
                __m128i v1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&vec1[avxSize]));
                __m128i v2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&vec2[avxSize]));
                sum128 = _mm_add_epi32(sum128, _mm_mullo_epi32(v2, v1));
                //cleanupSize -= 4; // stops MSVC and clang from realizing cleanupSize is less than 1 vector
                avxSize += 4;
        }
        int dotProduct = hsum_epi32_avx(sum128);
        cleanupSize = size % 4;  // keep it super simple so compilers still don't auto-vectorize the final cleanup
        // Sum the remaining elements using scalar operations
        for (std::size_t i = 0; i < cleanupSize; ++i)
        {
            // clang goes nuts with this, spending tons of instructions on maskload
            // and another hsum.  MSVC just unrolls it with some branching
            dotProduct += vec1[avxSize+i] * vec2[avxSize+i];
        }
    
        return dotProduct;
    }
    

    See how it compiles on Godbolt, much more efficiently:

    ... subtract one pointer from another so only one needs an indexed addr mode
    $LL4@dotProduct:                            ;; The main loop, still not unrolled
            vmovdqu ymm1, YMMWORD PTR [rax]                 ; 1 uop
            vpmulld ymm1, ymm1, YMMWORD PTR [rdx+rax]       ; 3 uops
            lea     rax, QWORD PTR [rax+32]                 ; 1 uop
            vpaddd  ymm2, ymm1, ymm2                        ; 1 uop
            sub     rcx, 1                                  ; macro-fused with JNE
            jne     SHORT $LL4@dotProduct                   ; 1 uop
    
    $LN3@dotProduct:
            movzx   eax, r9b
            and     al, 7                         ; size % 8
            vextracti128 xmm0, ymm2, 1
            vpaddd  xmm3, xmm0, xmm2              ; sum128 = first step of hsum
    
            cmp     al, 4
            jb      SHORT $LN44@dotProduct        ; if (cleanupSize < 4) goto skip last __m128
            vmovdqu xmm1, XMMWORD PTR [rbx+rdi*4]
            mov     rax, QWORD PTR [rsi]          ; Maybe reloading a std::vector begin or end pointer?  I didn't look at the whole function
            vpmulld xmm1, xmm1, XMMWORD PTR [rax+rdi*4]
            vpaddd  xmm3, xmm1, xmm3              ; sum128 += 
            add     rdi, 4                        ; avxSize += 4
    
    $LN44@dotProduct:
            xor     r14d, r14d
            and     r9d, 3                        ; cleanupSize = size % 4
            mov     r10d, r14d
            mov     r11d, r14d
            vpunpckhqdq xmm0, xmm3, xmm3          ; hsum_epi32_avx
            vpaddd  xmm2, xmm0, xmm3
            vpshufd xmm1, xmm2, 177
            vpaddd  xmm0, xmm1, xmm2
            vmovd   r8d, xmm0                     ; scalar int dotProduct
    
        ... then 0 to 3 scalar imul
    

    So the total shuffling to get int dotProduct = sum(vector) was only 3 shuffles and 3 vector adds, vs. 4x2 (high 128) + 4 (low 128) shuffles plus 8x vmovd + 7x scalar add from result[0] + result[1] + ... because of how badly MSVC did with that.

    BTW, clang also auto-vectorizes dotProductNoAVX, unrolling by 4 vectors so it's doing 32 elements per loop iteration. But its only cleanup is a scalar loop after hsumming the vector result, so could be doing as many as 31 iterations of scalar imul. That would be fine for 100 elements since 100 % 32 == 4, but would be much worse for size = 95.


    Footnote 1: benchmarking methodology
    In general to make a robust benchmark, you might try ++vector1[80 + (i&16)] to make the data different every time. (Modifying a later element means it will hopefully not still be in the store buffer when the dot product is doing a vector load, otherwise you'll get a store-forwarding stall.) On GNU C compilers, asm("" :: "r"(vector1.data() : "memory") would be a good way to tell the compiler that vector data might have changed.

    Either way, assign the result to volatile int resultAVX; or use something like Benchmark::DoNotOptimize() to force the compiler to materialize it in a register inside the loop instead of only doing the calculation for the final repeat loop whose result is actually used.

    See also Idiomatic way of performance evaluation? re: those considerations and other things like warm-up effects.


    Footnote 2: Or reduce down to 64-bit before moving to scalar for the last add. The final step can be vmovq rax, xmm0 / rorx rcx, rax, 32 / add eax, ecx. But MSVC apparently doesn't know how to emit rorx for rotates even though it will use other BMI2 instructions like shlx for x << y, so it's probably not worth trying to get MSVC to emit optimal asm that puts less pressure on the three vector execution ports. That would be useful if you're doing a lot of separate dot products.

    Actually, with lots of separate dot products, you could reduce 8x __m256i down to 8 results by shuffling them together, instead of just reducing each one separately. Like vphaddd with two separate inputs to reduce 2 vectors down to 1. (That's 2 shuffle + 1 add uop per vphaddd, so it's only worth using for cases like this). You'd start with 2x vperm2i128 (_mm256_permute2x128_si256) to feed one _mm256_add_epi32.