c++performanceopen-sourceeigen

Modification to Eigen library source code somehow degrades performance


One very very common operation in my program is taking a vector, and making with it a vector in the same direction, with magnitude inverse to the original vector's square (v -> vNorm/v^2). I'm using the Eigen library for handling vectors, since it's supposed to be quite fast.

So far I've computed this vector like so:

Vector2d directedInverseSquare(const Vector2d &diff) {

    return diff / (diff.norm() * diff.squaredNorm());

}

Which has worked fine. But it's not very efficient, since it computes both norm() and squaredNorm() separately (to my knowledge).

So I've been thinking for a while about modifying the Eigen source code to make a new type of norm, which fixes this double computation.

Important note: trying to save squareNorm and computing std::sqrt(sNorm)*sNorm always comes up short when comparing to the above implementation.

I'm not close to good enough at C++ to understand in any capacity how the Eigen source code works, but after looking, I've found what I'm fairly confident is the place to implement my norm.

In the Dot.h file of Eigen/src/Core, I found this definition:

template<typename Derived>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const typename MatrixBase<Derived>::PlainObject
MatrixBase<Derived>::normalized() const
{
  typedef typename internal::nested_eval<Derived,2>::type _Nested;
  _Nested n(derived());
  RealScalar z = n.squaredNorm();
  // NOTE: after extensive benchmarking, this conditional does not impact performance, at least on recent x86 CPU
  if(z>RealScalar(0))
    return n / numext::sqrt(z);
  else
    return n;
}

I copied it, and modified the line return n / numext::sqrt(z) to n / (z*numext::sqrt(z)). I also changed the name to newtonianNormalized (owing to its origins) and added a declaration in an earlier part of the file. I did the same with the norm's definition:

template<typename Derived>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE typename NumTraits<typename internal::traits<Derived>::Scalar>::Real MatrixBase<Derived>::norm() const
{
  return numext::sqrt(squaredNorm());
}

For which I added the line RealScalar squared = squaredNorm();, and changed the original line to return numext::sqrt(squared)*squared;.

In both instances the change to the code for a newtonian norm only requires one extra multiplication, which is a light cost. No other new computations are preformed.

After benchmarking, the results are very strange. My benchmarking consists of running large loops of the same operation over and over again, and summing up all the results to deny the compiler from contracting the loop into nothing. I use v.setRandom() to get random vectors, and I ran the loops and measured over ~20 runs of the program.

The results, presented with very basic bar charts: Newtonian Normed vector calculations

The above chart presents different methods for calculating the "normed" vectors. The result is in milliseconds, using a 100000000 time loop.

enter image description here

The above chart presents different methods for calculating the "norm". The result is in milliseconds, also using a 100000000 time loop.

Note that the charts start nowhere close to 0, and so it's hard to see at first glance that the results barely differ at all.

I really expected a major performance boost: I cut the amount of calculations basically in half, and reuse the calculated quantities instead. How could it possibly be basically the same?

Edit: I compile with MSYS2 Mingw64 with GCC version 13.2.0 and Clion+CMake, with the following line:

C:\msys64\mingw64\bin\c++.exe -IC:/Users/.../MINREP/include/eigen-3.4.0 -isystem C:/Users/.../MINREP/include/SFML-2.6.1/include -O3 -DNDEBUG -march=native -std=gnu++23 -fdiagnostics-color=always -MD -MT CMakeFiles/MINREP.dir/main.cpp.obj -MF CMakeFiles\MINREP.dir\main.cpp.obj.d -o CMakeFiles/MINREP.dir/main.cpp.obj -c C:/Users/.../MINREP/main.cpp

I save the time before and after each loop with the chrono.h library. Please tell me if any other info is needed.

Edit: Here are both functions after modification:

template<typename Derived>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE typename NumTraits<typename internal::traits<Derived>::Scalar>::Real MatrixBase<Derived>::newtonianNorm() const
{
    RealScalar squared = squaredNorm();
    return numext::sqrt(squared)*squared;
}
template<typename Derived>
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const typename MatrixBase<Derived>::PlainObject
MatrixBase<Derived>::newtonianNormalized() const
{
    typedef typename internal::nested_eval<Derived,2>::type _Nested;
    _Nested n(derived());
    RealScalar z = n.squaredNorm();
    // NOTE: after extensive benchmarking, this conditional does not impact performance, at least on recent x86 CPU
    if(z>RealScalar(0))
        return n / (z*numext::sqrt(z));
    else
        return n;
}

Solution

  • This answer explain why the initial code (directedInverseSquare) is already well optimized and so why the alternative implementations does not impact much performance in practice on your target machine. It also provide a way to improve performance.


    The function directedInverseSquare alone results in the following assembly code (when it is not inlined):

            vmovapd xmm2, XMMWORD PTR [rsi]
            mov     rax, rdi
            vmulpd  xmm0, xmm2, xmm2
            vunpckhpd       xmm1, xmm0, xmm0
            vaddsd  xmm1, xmm0, xmm1
            vmovq   xmm0, xmm1
            vsqrtsd xmm0, xmm0, xmm0
            vmulsd  xmm0, xmm0, xmm1
            vmovddup        xmm0, xmm0
            vdivpd  xmm2, xmm2, xmm0
            vmovapd XMMWORD PTR [rdi], xmm2
            ret
    

    This code basically load the input vector from memory to a register, square the two components (using only 1 SIMD instruction), shuffle them so to add the two resulting squared values, then compute the scalar square-root, multiply the result so to compute diff.norm() ** 1.5 and finally compute the inverse before storing back the result in memory. All of this is done in a scalar way (except the multiplication). One can also see that the norm is already computed once. This is because the compiler can inline the functions and then detect common sub-expressions (see Common sub-expression elimination).

    If this function is called in a loop and inlined, the situation is not much better (see on godbolt). It is hard for the compiler to generate a faster code since the input-layout is not SIMD-friendly. A more optimizing compiler can certainly shuffle the input so to compute multiple square-root and divide multiple value at once thanks to SIMD instructions. In practice GCC seems not to be able to do that.

    On your Intel Tiger-Lake CPU, the loop is clearly limited by the execution port responsible to both compute the scalar division and the scalar square-root, and apparently only that (assuming the loop is sufficiently big and not memory-bound). This means that all other instructions have (nearly) no overhead (thanks to super-scalar execution). Since all alternative still need to compute the scalar division and scalar square-root, it explains why you do not see much a difference (certainly just execution noise, cache effects, or possibly a slight variation of the tiny overhead of the other instructions).

    The main way to make this computation faster is to compute the square-root and division using SIMD instructions. To do that, you can either do it manually (often cumbersome and not portable, or alternatively it requires external libraries), or help the compiler to auto-vectorize the loop by storing components more contiguously. This means the fist component of all vector should be stored in an array and the second in another array. The compiler can then read components using packed SIMD instructions and also use SIMD instructions for the rest of the computation. You CPU can compute twice more divisions and twice more square-roots per cycle with SIMD instructions. Thus, the resulting code can be up to twice faster (it should be close to that in practice).

    If your input data is huge, then please note that you can also use multiple threads.

    PS: the version of Eigen used for the test is the 3.4.0.