g++: optimization -march=haswell and newer changes numerical result

I have been working on optimizing performance and of course doing regression tests when I noticed that g++ seems to alter results depending on chosen optimization. So far I thought that -O2 -march=[whatever] should yield the exact same results for numerical computations regardless of what architecture is chosen. However this seems not to be the case for g++. While using old architectures up to ivybridge yields the same results as clang does for any architecture, I get different results for gcc for haswell and newer. Is this a bug in gcc or did I misunderstand something about optimizations? I am really startled because clang does not seem to show this behavior.

Note that I am well aware that the differences are within machine precision, but they still disturb my simple regression checks.

Here is some example code:

#include <iostream>
#include <armadillo>

int main(){
    arma::arma_rng::set_seed(3);
    arma::sp_cx_mat A = arma::sprandn<arma::sp_cx_mat>(20,20, 0.1);
    arma::sp_cx_mat B = A + A.t();
    arma::cx_vec eig;
    arma::eigs_gen(eig, B, 1, "lm", 0.001);
    std::cout << "eigenvalue: " << eig << std::endl;
}

Compiled using:

g++ -march=[architecture] -std=c++14 -O2 -o test example.cpp -larmadillo

gcc version: 6.2.1

clang version: 3.8.0

Compiled for 64 bit, executed on an Intel Skylake processor.

Solution

It is because GCC uses fused-multiply-add (fma) instruction by default, if it is available. Clang, on the contrary, doesn't use them by default, even if it is available.

Result from a*b+c can differ whether fma used or not, that's why you get different results, when you use -march=haswell (Haswell is the first Intel CPU which supports fma).

You can decide whether you want to use this feature with -ffp-contract=XXX.

-ffp-contract=off, you won't get fma instructions.
-ffp-contract=on, you get fma instructions, but only in the case of contraction if allowed by the language standard. In current version of GCC, this means off (because it is not implemented yet).
-ffp-contract=fast (that's the GCC default), you'll get fma instrucions.