c++floating-pointssenumericfast-math

Why is std::inner_product slower than the naive implementation?


This is my naive implementation of dot product:

float simple_dot(int N, float *A, float *B) {
    float dot = 0;
    for(int i = 0; i < N; ++i) {
    dot += A[i] * B[i];
    }

    return dot;
}

And this is using the C++ library:

float library_dot(int N, float *A, float *B) {
    return std::inner_product(A, A+N, B, 0);
}

I ran some benchmark(code is here https://github.com/ijklr/sse), and the library version is a lot slower. My compiler flag is -Ofast -march=native


Solution

  • Your two functions don't do the same thing. The algorithm uses an accumulator whose type is deduced from the initial value, which in your case (0) is int. Accumulating floats into an int does not just take longer than accumulating into a float, but also produces a different result.

    The equivalent of your raw loop code is to use the initial value 0.0f, or equivalently float{}.

    (Note that std::accumulate is a very similar in this regard.)