I have a heavy number-crunching program that does image processing. It is mostly convolutions. It is written in C++ and compiled with Mingw GCC 4.8.1. I run it on a laptop with a Intel Core i7 4900MQ (with SSE up to SSE4.2 and AVX2).
When I tell GCC to use SSE optimisations (with -march=native -mfpmath=sse -msse2 ), I see no speedup compared to using the default x87 FPU.
When I use doubles instead of floats, there is no slowdown.
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
Yes, you are.
Compiler is as good as your code - remember that. If you didn't design your algorithm with vectorization in mind, compiler is powerless. It is not that easy: "turn the switch on and enjoy 100% performance boost".
First of all, compile your code with -ftree-vectorizer-verbose=N
to see, what really was vectorized by the compiler.
N
is the verbosity level, make that 5
to see all available output (more info can be found here).
Also, you may want to read about GCC's vectorizer.
And keep in mind, that for performance-critical sections of code, using SSE/AVX intrinsics (brilliantly documented here) directly may be the best option.