c++gccssex87

SSE gives no speedup for C++ number crunching


I have a heavy number-crunching program that does image processing. It is mostly convolutions. It is written in C++ and compiled with Mingw GCC 4.8.1. I run it on a laptop with a Intel Core i7 4900MQ (with SSE up to SSE4.2 and AVX2).

When I tell GCC to use SSE optimisations (with -march=native -mfpmath=sse -msse2 ), I see no speedup compared to using the default x87 FPU.

When I use doubles instead of floats, there is no slowdown.

My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?


Solution

  • My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?

    Yes, you are.

    Compiler is as good as your code - remember that. If you didn't design your algorithm with vectorization in mind, compiler is powerless. It is not that easy: "turn the switch on and enjoy 100% performance boost".

    First of all, compile your code with -ftree-vectorizer-verbose=N to see, what really was vectorized by the compiler.

    N is the verbosity level, make that 5 to see all available output (more info can be found here).

    Also, you may want to read about GCC's vectorizer.

    And keep in mind, that for performance-critical sections of code, using SSE/AVX intrinsics (brilliantly documented here) directly may be the best option.