Vectorization flags with Eigen and IPOPT

I have some C++ function that I am optimizing with IPOPT. Although the cost function, constraint functions, etc. are written in C++, the code was originally written to use the C-interface. I haven't bothered to change that yet unless it turns out to be the issue.

Anyway... We are observing some unexpected behavior where the optimizer converges differently when we compile the program with/without vectorization flags. Specifically, in the CMakeLists file, we have

set(CMAKE_CXX_FLAGS "-Wall -mavx -mfma")

When we run the optimizer with these settings, then the optimizer converges in approximately 100 iterations. So far, so good.

However, we have reason to believe that when compiled for ARM (Android specifically), there is no vectorization occurring because the performance is drastically different than on an Intel processor. The Eigen documentation says that NEON instructions should always be enabled for 64-bit ARM, but we have reason to suspect that that is not occurring. Anyway, that is not the question here.

Due to this suspicion, we wanted to see how bad the performance would be on our Intel processor if we disabled vectorization. This should give us some indication of how much vectorization is occurring, and how much improvement we might expect to see in ARM. However, when we change the compiler flags to

set(CMAKE_CXX_FLAGS "-Wall")

(or to just to the case where we just use AVX (without fma)), then we get the same general solution from the optimizer, but with very different converge performance. Specifically, without vectorization, the optimizer takes about 500 iterations to converge to the solution.

So in summary:

With AVX and FMA      : 100 iterations to converge
With AVX              : 200 iterations to converge
Without AVX and FMA   : 500 iterations to converge

We are literally only changing that one line in the cmake file, not the source code.

I would like some suggestions for why this may be occurring.

My thoughts and more background info:

It seems to me that either the version with or without vectorization must be doing some rounding, and that is making IPOPT converge differently. I was under the impression that adding AVX and FMA flags would not change the output of the functions, but rather only the time it takes to compute them. I appear to be wrong.

The phenomenon we are observing appears particularly strange to me because on one hand we are observing that the optimizer always converges to the same solution. This somehow suggests that the problem can't be too ill-conditioned. However, on the other hand, the fact that the optimizer is behaving differently with/without vectorization flags suggests that the problem IS indeed sensitive to whatever small residuals are generated by vectorized instructions.

One other thing to keep in mind is that we precompiled IPOPT into a library, and are simply linking our code against that precompiled library. So I don't think that the AVX and FMA flags can be affecting the optimizer itself. That seems to mean that our functions must be outputting values with tangibly different values depending on whether vectorization is enabled.

For those interested, here is the full cmake file

cmake_minimum_required(VERSION 3.5)

# If a build type is not passed to cmake, then use this...
if(NOT CMAKE_BUILD_TYPE)
    # set(CMAKE_BUILD_TYPE Release)
    set(CMAKE_BUILD_TYPE Debug)
endif()

# If you are debugging, generate symbols.
set(CMAKE_CXX_FLAGS_DEBUG "-g")

# If in release mode, use all possible optimizations
set(CMAKE_CXX_FLAGS_RELEASE "-O3")

# We need c++11
set(CMAKE_CXX_STANDARD 11)

# Show us all of the warnings and enable all vectorization options!!!
# I must be crazy because these vectorization flags seem to have no effect.
set(CMAKE_CXX_FLAGS "-Wall -mavx -mfma")

if (CMAKE_SYSTEM_NAME MATCHES "CYGWIN")
    include_directories(../../Eigen/
            /cygdrive/c/coin/windows/ipopt/include/coin/
            /cygdrive/c/coin/windows/ipopt/include/coin/ThirdParty/)
    find_library(IPOPT_LIBRARY ipopt HINTS /cygdrive/c/coin/windows/ipopt/lib/)
else ()
    include_directories(../../Eigen/
            ../../coin/CoinIpopt/build/include/coin/
            ../../coin/CoinIpopt/build/include/coin/ThirdParty/)
    find_library(IPOPT_LIBRARY ipopt HINTS ../../coin/CoinIpopt/build/lib/)
endif ()

# Build the c++ functions into an executable
add_executable(trajectory_optimization main.cpp)

# Link all of the libraries together so that the C++-executable can call IPOPT
target_link_libraries(trajectory_optimization ${IPOPT_LIBRARY})

Solution

Enabling FMA will result in different rounding behavior, which can lead to very different results, if your algorithm is not numerically stable. Also, enabling AVX in Eigen will result in different order of additions and since floating point math is non-associative, this can also lead to slightly different behavior.

To illustrate why non-associativity can make a difference, when adding 8 consecutive doubles a[8] with SSE3 or with AXV, Eigen will typically produce code equivalent to the following:

// SSE:
double t[2]={a[0], a[1]};
for(i=2; i<8; i+=2)
   t[0]+=a[i], t[1]+=a[i+1]; // addpd
t[0]+=t[1];                  // haddpd

// AVX:
double t[4]={a[0],a[1],a[2],a[3]};
for(j=0; j<4; ++j) t[j]+=a[4+j]; // vaddpd
t[0]+=t[2]; t[1]+=t[3];          // vhaddpd
t[0]+=t[1];                      // vhaddpd

Without more details it is hard to tell what exactly happens in your case.