c++eigenmatrix-multiplicationtensornumpy-einsum

Write numpy einsum operation as eigen tensors


I want to write the following numpy einsum as a an Eigen Tensor op

import numpy as np

L = np.random.rand(2, 2, 136)
U = np.random.rand(2, 2, 136)

result = np.einsum('ijl,jkl->ikl', U, L)

I can write it with for loops like so in C++

  for (int i = 0; i < 2; i++) {
    for (int j = 0; j < 2; j++) {
      for (int k = 0; k < 2; k++) {
        for (int l = 0; l < 136; l++) {
          result(i, k, l) += U(i, j, l) * L(j, k, l);
        }
      }
    }
  }

How do I write in eigen notation using its operations? Using for loops doesn't allow eigen to properly vectorize the operations, as I have complicated scalar types.

Edit.

As asked for, a Jet is an extension of dual numbers, where each element is a number, followed by an array of gradients of that number wrt some parameters. http://ceres-solver.org/automatic_derivatives.html

A naive implmentation might look like

template<typename T, int N>
struct Jet
{
    T a;
    T v[N];
};

If the jet is written using eigen ops, the idea is that using expression templates, eigen should vectorize all operations directly.


Solution

  • There is no contraction happening in the 3rd dimension "l" in your case. So, in a sense, L and U are arrays of length 136 of 2x2 matrices, and you are multiplying the matrix U[l] with L[l]. I think doing something similar to np.einsum with Eigen is therefore not possible; Eigen::Tensor::contract only supports "real" contractions. But one can of course always do the loop over the 3rd dimension manually. But as shown below, this performs very badly.

    Nevertheless, there are ways to speed things up and vectorize the loops, by either relying on automatic vectorization (did not work well for me) or by giving additional compiler hints (via OpenMP SIMD).

    In the following, I define cDim12=2 as the size of the first and second dimension, and cDim13=136 as the third dimension. For the timings, all code was compiled with -O3 -mavx with gcc 11.2 and clang 15.0.2. I used google benchmark to get the timings on an Intel Core i7-4770K (yeah, quite a few years old, sorry). Eigen trunk (08c961e83) from 20th January 2023 was used.

    TL;DR: To summarize the results below:

    Note: Measure your real world application, since things might behave completely different there!

    Code from the original post as baseline ("FromOriginalPost")

    The straightforward code from your original post looks like this and is used as baseline.

    Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
    result.setZero();
    for (int i = 0; i < cDim12; i++) {
      for (int j = 0; j < cDim12; j++) {
        for (int k = 0; k < cDim12; k++) {
          for (int l = 0; l < cDim3; l++) {
            result(i, k, l) += U(i, j, l) * L(j, k, l);
          }
        }
      }
    }
    

    Optimized loop order ("OptimizedOrder")

    Note that Eigen::Tensor uses column major order by default (and row major is not recommended). Thus, in an expression such as U(i, j, l), the i should be the fastest (most inner) loop and l the slowest (most outer) loop. Reordering as best as I could:

    for (int l = 0; l < cDim3; l++) {
      for (int j = 0; j < cDim12; j++) {
        for (int k = 0; k < cDim12; k++) {
          for (int i = 0; i < cDim12; i++) {
            result(i, k, l) += U(i, j, l) * L(j, k, l);
          }
        }
      }
    }
    

    This is 1.3x-1.4x faster.

    Using Eigen::Tensor::chip and contract ("EigenChipAndContract")

    Using Eigen features as much as possible, I came up with the following:

    Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
    Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
    for (int l = 0; l < cDim3; l++) {
      result.chip(l, 2) = U.chip(l, 2).contract(L.chip(l, 2), productDims);
    }
    

    This performs very bad: It is 18x slower on gcc and 24x slower on clang when compared to "FromOriginalPost".

    Using Eigen::TensorMap and contract ("EigenMapAndContract")

    The "EigenChipAndContract" might do a lot of copying, so another idea was to use Eigen::TensorMap to get "references" to each necessary "slice" of data. For the raw array access, note again that Eigen uses column-major order.

    Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
    Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
    for (int l = 0; l < cDim3; l++) {
      Eigen::TensorMap<Eigen::Tensor<double, 2>> U_chip(U.data() + l * cDim12 * cDim12, cDim12, cDim12);
      Eigen::TensorMap<Eigen::Tensor<double, 2>> L_chip(L.data() + l * cDim12 * cDim12, cDim12, cDim12);
      Eigen::TensorMap<Eigen::Tensor<double, 2>> result_chip(result.data() + l * cDim12 * cDim12, cDim12, cDim12);
      result_chip = U_chip.contract(L_chip, productDims);
    }
    

    This is actually somewhat faster than "EigenChipAndContract", but still very slow. Compared to "FromOriginalPost", this is 14x slower for gcc and 19x slower for clang.

    Vectorization with OpenMP ("EigenAccessWithOMP")

    Although both gcc and clang can do automatic vectorization, without additional hints they do not yield good results. However, both support the OpenMP pragma #pragma omp simd collapse(4), when compiled with -fopenmp:

    #pragma omp simd collapse(4)
    for (int l = 0; l < cDim3; l++) {
      for (int j = 0; j < cDim12; j++) {
        for (int k = 0; k < cDim12; k++) {
          for (int i = 0; i < cDim12; i++) {
            result(i, k, l) += U(i, j, l) * L(j, k, l);
          }
        }
      }
    }
    

    Compilation with -O3 -mavx -fopenmp results in

    Vectorization with OpenMP + direct raw access ("DirectAccessWithOMP")

    The previous code used the Eigen::Tensor::operator(), which should inline to the raw array accesses. However, remembering the column-major layout, we can also access the underlying array directly and check whether this improves anything. It also allows to give the hint to the compiler again that the data is properly aligned (although Eigen already defines them as such).

    double * pR = result.data();
    double * pU = U.data();
    double * pL = L.data();
    
    #pragma omp simd collapse(4) aligned(pR, pU, pL: 32) // 32: For AVX
    for (int l = 0; l < cDim3; l++) {
      for (int j = 0; j < cDim12; j++) {
        for (int k = 0; k < cDim12; k++) {
          for (int i = 0; i < cDim12; i++) {
            pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
          }
        }
      }
    }
    

    Somewhat surprisingly, this is 1.1x faster for gcc and 1.4x faster for clang when compared with "EigenAccessWithOMP". When compared with the original "FromOriginalPost", it is 2.8x faster for gcc and 2.5x slower for clang.

    When viewed on godbolt, gcc really produces some quite concise assembly.

    Python

    Not sure how far fetched it is to compare the absolute execution time of a call to np.einsum with the C++ version, since Python needs to do additional string parsing etc. Nevertheless, here is the code:

    import numpy as np
    import timeit
    
    L = np.random.rand(2, 2, 136)
    U = np.random.rand(2, 2, 136)
    
    numIterations = 1000000
    timing = timeit.timeit(lambda: np.einsum('ijl,jkl->ikl', U, L), number=numIterations)
    print(f"np.einsum (per iteration): {timing.real/(numIterations*1e-9)}ns")
    

    For Python 3.9 and numpy-1.24.1 this is roughly 6x times slower compared to "FromOriginalPost" and 16x times slower compared to "DirectAccessWithOMP" for gcc.

    Raw timings

    For gcc:

    ---------------------------------------------------------------
    Benchmark                     Time             CPU   Iterations
    ---------------------------------------------------------------
    FromOriginalPost            823 ns          823 ns      3397793
    OptimizedOrder              573 ns          573 ns      4895246
    DirectAccess               1306 ns         1306 ns      2142826
    EigenAccessWithOMP          324 ns          324 ns      8655549
    DirectAccessWithOMP         296 ns          296 ns      9418635
    EigenChipAndContract      14405 ns        14405 ns       193548
    EigenMapAndContract       11390 ns        11390 ns       243122
    

    For clang:

    ---------------------------------------------------------------
    Benchmark                     Time             CPU   Iterations
    ---------------------------------------------------------------
    FromOriginalPost            753 ns          753 ns      3714543
    OptimizedOrder              570 ns          570 ns      4921914
    DirectAccess                569 ns          569 ns      4929755
    EigenAccessWithOMP         2704 ns         2704 ns      1037819
    DirectAccessWithOMP        1908 ns         1908 ns      1466390
    EigenChipAndContract      17713 ns        17713 ns       157427
    EigenMapAndContract       14064 ns        14064 ns       198875
    

    Python:

    np.einsum (per iteration): 4873.6035999991145 ns
    

    Full code

    Also on godbolt, however not really useful since the compiler times out there quite often. Locally I compiled with -O3 -DNDEBUG -std=c++17 -mavx -fopenmp -Wall -Wextra.

    #include <iostream>
    #include <iomanip>
    #include <cmath>
    
    #include <unsupported/Eigen/CXX11/Tensor>
    #include <benchmark/benchmark.h>
    
    //====================================================
    // Globals
    //====================================================
    
    static constexpr int cDim12 = 2;
    static constexpr int cDim3 = 136;
    
    
    Eigen::Tensor<double, 3> CreateRandomTensor()
    {
      Eigen::Tensor<double, 3> m(cDim12, cDim12, cDim3);
      m.setRandom();
      return m;
    }
    
    
    Eigen::Tensor<double, 3> const L = CreateRandomTensor();
    Eigen::Tensor<double, 3> const U = CreateRandomTensor();
    
    
    //====================================================
    // Helpers
    //====================================================
    
    Eigen::Tensor<double, 3> ReferenceResult() 
    {
      Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
      result.setZero();
      for (int i = 0; i < cDim12; i++) {
        for (int j = 0; j < cDim12; j++) {
          for (int k = 0; k < cDim12; k++) {
            for (int l = 0; l < cDim3; l++) {
              result(i, k, l) += U(i, j, l) * L(j, k, l);
            }
          }
        }
      }
    
      return result;
    }
    
    
    
    void CheckResult(Eigen::Tensor<double, 3> const & result) 
    {
      Eigen::Tensor<double, 3> const ref = ReferenceResult();
      Eigen::Tensor<double, 3> const diff = ref - result;
      Eigen::Tensor<double, 0> const max = diff.maximum();
      Eigen::Tensor<double, 0> const min = diff.minimum();
      double const maxDiff = std::max(std::abs(max(0)), std::abs(min(0)));
      if (maxDiff > 1e-14) {
        std::cerr << "ERROR! Max Diff = " << std::setprecision(17) << maxDiff << std::endl;
      }
    }
    
    
    //====================================================
    // Benchmarks
    //====================================================
    
    
    static void FromOriginalPost(benchmark::State& state) 
    {
      Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        result.setZero();
        for (int i = 0; i < cDim12; i++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int l = 0; l < cDim3; l++) {
                result(i, k, l) += U(i, j, l) * L(j, k, l);
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    
      CheckResult(result);
    }
    BENCHMARK(FromOriginalPost);
    
    
    static void OptimizedOrder(benchmark::State& state) 
    {
      Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        result.setZero();
        for (int l = 0; l < cDim3; l++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int i = 0; i < cDim12; i++) {
                result(i, k, l) += U(i, j, l) * L(j, k, l);
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    
      CheckResult(result);
    }
    BENCHMARK(OptimizedOrder);
    
    
    static void DirectAccess(benchmark::State& state) 
    {
      Eigen::Tensor<double, 3> U = ::U;
      Eigen::Tensor<double, 3> L = ::L;
    
      Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        result.setZero();
        double * pR = result.data();
        double * pU = U.data();
        double * pL = L.data();
    
        for (int l = 0; l < cDim3; l++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int i = 0; i < cDim12; i++) {
                pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    
      CheckResult(result);
    }
    BENCHMARK(DirectAccess);
    
    
    static void EigenAccessWithOMP(benchmark::State& state) 
    {
      Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        result.setZero();
        #pragma omp simd collapse(4)
        for (int l = 0; l < cDim3; l++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int i = 0; i < cDim12; i++) {
                result(i, k, l) += U(i, j, l) * L(j, k, l);
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    
      CheckResult(result);
    }
    BENCHMARK(EigenAccessWithOMP);
    
    
    static void DirectAccessWithOMP(benchmark::State& state) 
    {
      Eigen::Tensor<double, 3> U = ::U;
      Eigen::Tensor<double, 3> L = ::L;
    
      Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        result.setZero();
        double * pR = result.data();
        double * pU = U.data();
        double * pL = L.data();
    
        #pragma omp simd collapse(4) aligned(pR, pU, pL: 32) // 32: For AVX
        for (int l = 0; l < cDim3; l++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int i = 0; i < cDim12; i++) {
                pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    
      CheckResult(result);
    }
    BENCHMARK(DirectAccessWithOMP);
    
    
    static void EigenChipAndContract(benchmark::State& state) 
    {
      Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
      Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        result.setZero();
        for (int l = 0; l < cDim3; l++) {
          result.chip(l, 2) = U.chip(l, 2).contract(L.chip(l, 2), productDims);
        }
        benchmark::DoNotOptimize(result.data());
      }
    
      CheckResult(result);
    }
    BENCHMARK(EigenChipAndContract);
    
    
    static void EigenMapAndContract(benchmark::State& state) 
    {
      Eigen::Tensor<double, 3> U = ::U;
      Eigen::Tensor<double, 3> L = ::L;
      Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
    
      Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        result.setZero();
        for (int l = 0; l < cDim3; l++) {
          Eigen::TensorMap<Eigen::Tensor<double, 2>> U_chip(U.data() + l * cDim12 * cDim12, cDim12, cDim12);
          Eigen::TensorMap<Eigen::Tensor<double, 2>> L_chip(L.data() + l * cDim12 * cDim12, cDim12, cDim12);
          Eigen::TensorMap<Eigen::Tensor<double, 2>> result_chip(result.data() + l * cDim12 * cDim12, cDim12, cDim12);
          result_chip = U_chip.contract(L_chip, productDims);
        }
        benchmark::DoNotOptimize(result.data());
      }
    
      CheckResult(result);
    }
    BENCHMARK(EigenMapAndContract);
    
    
    BENCHMARK_MAIN();
    


    EDIT for jets

    After the original post was edited, the arithmetic types used are not really built-ins but rather jets. Eigen can be extended to support custom types (as briefly outlined here). However, the Eigen::Tensor::contract() function nevertheless does not "magically" support the equivalent of np.einsum('ijl,jkl->ikl', U, L) since the last dimension l does not really perform a contraction. Of course, one could write one, but this seems far from trivial.

    If the only required contraction-like operation is the one from the original post, and also the tensors are not further multiplied/added/etc, the simplest thing to do is to implement the single loop manually and play around with compilers, compiler settings, pragmas, etc to figure out the best performance.

    Jet type (adapted from here):

    template<int N> struct Jet {
      double a = 0.0;
      Eigen::Matrix<double, 1, N> v = Eigen::Matrix<double, 1, N>::Zero();
    };
    
    template<int N> 
    EIGEN_STRONG_INLINE Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) {
      return Jet<N>{f.a + g.a, f.v + g.v};
    }
    
    template<int N> 
    EIGEN_STRONG_INLINE Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) {
      return Jet<N>{f.a * g.a, f.a * g.v + f.v * g.a};
    }
    

    For example (column-major)

    Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
    Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
    
    Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
    SetToZero(result);
    
    for (int l = 0; l < cDim3; l++) {
      for (int j = 0; j < cDim12; j++) {
        for (int k = 0; k < cDim12; k++) {
          for (int i = 0; i < cDim12; i++) {
            Jet<N> & r = result(i, k, l);
            r = r + U(i, j, l) * L(j, k, l);
          }
        }
      }
    }
    

    or with row-major order:

    Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> L = CreateRandomTensor<Eigen::RowMajor>();
    Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> U = CreateRandomTensor<Eigen::RowMajor>();
    
    Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> result(cDim12, cDim12, cDim3);
    SetToZero(result);
    
    for (int i = 0; i < cDim12; i++) {
      for (int k = 0; k < cDim12; k++) {
        for (int j = 0; j < cDim12; j++) {
          for (int l = 0; l < cDim3; l++) {
            Jet<N> & r = result(i, k, l);
            r = r + U(i, j, l) * L(j, k, l);
          }
        }
      }
    }
    

    gcc and clang yield the same performance. They auto-vectorize the column-major loops, but apparently not the row-major ones. Direct access of the underlying data does not improve things. Moreover, adding #pragma omp simd collapse(4) results in worse performance in both cases (clang also warns that the loops could not be vectorized); I guess the explicit SIMDs used in the matrix multiplication of Jet::v internally by Eigen are the reason.

    As an additional note again: The Eigen documentation says that you shouldn't really combine row-major order with Eigen::Tensor:

    The tensor library supports 2 layouts: ColMajor (the default) and RowMajor. Only the default column major layout is currently fully supported, and it is therefore not recommended to attempt to use the row major layout at the moment.

    Full code:

    #include <iostream>
    #include <iomanip>
    #include <cmath>
    
    #include <unsupported/Eigen/CXX11/Tensor>
    #include <benchmark/benchmark.h>
    
    
    static constexpr int cDim12 = 2;
    static constexpr int cDim3 = 136;
    
    
    template<int N> struct Jet {
      double a = 0.0;
      Eigen::Matrix<double, 1, N> v = Eigen::Matrix<double, 1, N>::Zero();
    };
    
    template<int N> 
    EIGEN_STRONG_INLINE Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) {
      return Jet<N>{f.a + g.a, f.v + g.v};
    }
    
    template<int N> 
    EIGEN_STRONG_INLINE Jet<N> operator-(const Jet<N>& f, const Jet<N>& g) {
      return Jet<N>{f.a - g.a, f.v - g.v};
    }
    
    template<int N> 
    EIGEN_STRONG_INLINE Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) {
      return Jet<N>{f.a * g.a, f.a * g.v + f.v * g.a};
    }
    
    template<int N> 
    EIGEN_STRONG_INLINE Jet<N> operator/(const Jet<N>& f, const Jet<N>& g) {
      return Jet<N>{f.a / g.a, f.v / g.a - f.a * g.v / (g.a * g.a)};
    }
    
    
    
    static constexpr int N = 8;
    
    
    template <Eigen::StorageOptions storage>
    auto CreateRandomTensor()
    {
      Eigen::Tensor<Jet<N>, 3, storage> result(cDim12, cDim12, cDim3);
      for (int l = 0; l < cDim3; l++) {
        for (int k = 0; k < cDim12; k++) {
          for (int i = 0; i < cDim12; i++) {
            Jet<N> jet;
            jet.a = (double)rand() / RAND_MAX;
            jet.v.setRandom();
            result(i, k, l) = jet;
          }
        }
      }
      return result;
    }
    
    
    template <class T>
    void SetToZero(T & result)
    {
      for (int l = 0; l < cDim3; l++) {
        for (int k = 0; k < cDim12; k++) {
          for (int i = 0; i < cDim12; i++) {
            result(i, k, l) = Jet<N>{};
          }
        }
      }
    }
    
    
    static void EigenAccessNoOMP(benchmark::State& state) 
    {
      srand(42);
      Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
      Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
    
      Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        SetToZero(result);
    
        for (int l = 0; l < cDim3; l++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int i = 0; i < cDim12; i++) {
                Jet<N> & r = result(i, k, l);
                r = r + U(i, j, l) * L(j, k, l);
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    }
    BENCHMARK(EigenAccessNoOMP);
    
    
    static void EigenAccessNoOMPRowMajor(benchmark::State& state) 
    {
      srand(42);
      Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> L = CreateRandomTensor<Eigen::RowMajor>();
      Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> U = CreateRandomTensor<Eigen::RowMajor>();
    
      Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        SetToZero(result);
    
        for (int i = 0; i < cDim12; i++) {
          for (int k = 0; k < cDim12; k++) {
            for (int j = 0; j < cDim12; j++) {
              for (int l = 0; l < cDim3; l++) {
                Jet<N> & r = result(i, k, l);
                r = r + U(i, j, l) * L(j, k, l);
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    }
    BENCHMARK(EigenAccessNoOMPRowMajor);
    
    
    
    static void DirectAccessNoOMP(benchmark::State& state) 
    {
      srand(42);
      Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
      Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
    
      Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        SetToZero(result);
    
        Jet<N> * pR = result.data();
        Jet<N> * pU = U.data();
        Jet<N> * pL = L.data();
    
        for (int l = 0; l < cDim3; l++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int i = 0; i < cDim12; i++) {
                Jet<N> & r = pR[i + cDim12*(k + cDim12*l)];
                r = r + pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    }
    BENCHMARK(DirectAccessNoOMP);
    
    
    static void EigenAccessWithOMP(benchmark::State& state) 
    {
      srand(42);
      Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
      Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
    
      Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        SetToZero(result);
    
        #pragma omp simd collapse(4)
        for (int l = 0; l < cDim3; l++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int i = 0; i < cDim12; i++) {
                Jet<N> & r = result(i, k, l);
                r = r + U(i, j, l) * L(j, k, l);
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    }
    BENCHMARK(EigenAccessWithOMP);
    
    
    
    static void DirectAccessWithOMP(benchmark::State& state) 
    {
      srand(42);
      Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
      Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
    
      Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
      for (auto _ : state) {
        SetToZero(result);
    
        Jet<N> * pR = result.data();
        Jet<N> * pU = U.data();
        Jet<N> * pL = L.data();
    
        #pragma omp simd collapse(4) aligned(pR, pU, pL: 32)
        for (int l = 0; l < cDim3; l++) {
          for (int j = 0; j < cDim12; j++) {
            for (int k = 0; k < cDim12; k++) {
              for (int i = 0; i < cDim12; i++) {
                Jet<N> & r = pR[i + cDim12*(k + cDim12*l)];
                r = r + pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
              }
            }
          }
        }
    
        benchmark::DoNotOptimize(result.data());
      }
    }
    BENCHMARK(DirectAccessWithOMP);
    
    
    
    BENCHMARK_MAIN();