[SOLVED] SSE gives no speedup for C++ number crunching

SSE gives no speedup for C++ number crunching

I have a heavy number-crunching program that does image processing. It is mostly convolutions. It is written in C++ and compiled with Mingw GCC 4.8.1. I run it on a laptop with a Intel Core i7 4900MQ (with SSE up to SSE4.2 and AVX2).

When I tell GCC to use SSE optimisations (with -march=native -mfpmath=sse -msse2 ), I see no speedup compared to using the default x87 FPU.

When I use doubles instead of floats, there is no slowdown.

My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?

Solution

My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?

Yes, you are.

Compiler is as good as your code - remember that. If you didn't design your algorithm with vectorization in mind, compiler is powerless. It is not that easy: "turn the switch on and enjoy 100% performance boost".

First of all, compile your code with -ftree-vectorizer-verbose=N to see, what really was vectorized by the compiler.

N is the verbosity level, make that 5 to see all available output (more info can be found here).

Also, you may want to read about GCC's vectorizer.

And keep in mind, that for performance-critical sections of code, using SSE/AVX intrinsics (brilliantly documented here) directly may be the best option.