I want to write the following numpy einsum as a an Eigen Tensor op
import numpy as np
L = np.random.rand(2, 2, 136)
U = np.random.rand(2, 2, 136)
result = np.einsum('ijl,jkl->ikl', U, L)
I can write it with for loops like so in C++
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
for (int k = 0; k < 2; k++) {
for (int l = 0; l < 136; l++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
How do I write in eigen notation using its operations? Using for loops doesn't allow eigen to properly vectorize the operations, as I have complicated scalar types.
Edit.
As asked for, a Jet is an extension of dual numbers, where each element is a number, followed by an array of gradients of that number wrt some parameters. http://ceres-solver.org/automatic_derivatives.html
A naive implmentation might look like
template<typename T, int N>
struct Jet
{
T a;
T v[N];
};
If the jet is written using eigen ops, the idea is that using expression templates, eigen should vectorize all operations directly.
There is no contraction happening in the 3rd dimension "l
" in your case. So, in a sense, L
and U
are arrays of length 136 of 2x2 matrices, and you are multiplying the matrix U[l]
with L[l]
. I think doing something similar to np.einsum
with Eigen is therefore not possible; Eigen::Tensor::contract
only supports "real" contractions. But one can of course always do the loop over the 3rd dimension manually. But as shown below, this performs very badly.
Nevertheless, there are ways to speed things up and vectorize the loops, by either relying on automatic vectorization (did not work well for me) or by giving additional compiler hints (via OpenMP SIMD).
In the following, I define cDim12=2
as the size of the first and second dimension, and cDim13=136
as the third dimension.
For the timings, all code was compiled with -O3 -mavx
with gcc 11.2 and clang 15.0.2. I used google benchmark to get the timings on an Intel Core i7-4770K (yeah, quite a few years old, sorry). Eigen trunk (08c961e83) from 20th January 2023 was used.
TL;DR: To summarize the results below:
Note: Measure your real world application, since things might behave completely different there!
The straightforward code from your original post looks like this and is used as baseline.
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
result.setZero();
for (int i = 0; i < cDim12; i++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int l = 0; l < cDim3; l++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
Note that Eigen::Tensor
uses column major order by default (and row major is not recommended). Thus, in an expression such as U(i, j, l)
, the i
should be the fastest (most inner) loop and l
the slowest (most outer) loop. Reordering as best as I could:
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
This is 1.3x-1.4x faster.
Eigen::Tensor::chip
and contract
("EigenChipAndContract")Using Eigen features as much as possible, I came up with the following:
Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (int l = 0; l < cDim3; l++) {
result.chip(l, 2) = U.chip(l, 2).contract(L.chip(l, 2), productDims);
}
This performs very bad: It is 18x slower on gcc and 24x slower on clang when compared to "FromOriginalPost".
Eigen::TensorMap
and contract
("EigenMapAndContract")The "EigenChipAndContract" might do a lot of copying, so another idea was to use Eigen::TensorMap
to get "references" to each necessary "slice" of data. For the raw array access, note again that Eigen uses column-major order.
Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (int l = 0; l < cDim3; l++) {
Eigen::TensorMap<Eigen::Tensor<double, 2>> U_chip(U.data() + l * cDim12 * cDim12, cDim12, cDim12);
Eigen::TensorMap<Eigen::Tensor<double, 2>> L_chip(L.data() + l * cDim12 * cDim12, cDim12, cDim12);
Eigen::TensorMap<Eigen::Tensor<double, 2>> result_chip(result.data() + l * cDim12 * cDim12, cDim12, cDim12);
result_chip = U_chip.contract(L_chip, productDims);
}
This is actually somewhat faster than "EigenChipAndContract", but still very slow. Compared to "FromOriginalPost", this is 14x slower for gcc and 19x slower for clang.
Although both gcc and clang can do automatic vectorization, without additional hints they do not yield good results. However, both support the OpenMP pragma #pragma omp simd collapse(4)
, when compiled with -fopenmp
:
#pragma omp simd collapse(4)
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
Compilation with -O3 -mavx -fopenmp
results in
The previous code used the Eigen::Tensor::operator()
, which should inline to the raw array accesses. However, remembering the column-major layout, we can also access the underlying array directly and check whether this improves anything. It also allows to give the hint to the compiler again that the data is properly aligned (although Eigen already defines them as such).
double * pR = result.data();
double * pU = U.data();
double * pL = L.data();
#pragma omp simd collapse(4) aligned(pR, pU, pL: 32) // 32: For AVX
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
Somewhat surprisingly, this is 1.1x faster for gcc and 1.4x faster for clang when compared with "EigenAccessWithOMP". When compared with the original "FromOriginalPost", it is 2.8x faster for gcc and 2.5x slower for clang.
When viewed on godbolt, gcc really produces some quite concise assembly.
Not sure how far fetched it is to compare the absolute execution time of a call to np.einsum
with the C++ version, since Python needs to do additional string parsing etc. Nevertheless, here is the code:
import numpy as np
import timeit
L = np.random.rand(2, 2, 136)
U = np.random.rand(2, 2, 136)
numIterations = 1000000
timing = timeit.timeit(lambda: np.einsum('ijl,jkl->ikl', U, L), number=numIterations)
print(f"np.einsum (per iteration): {timing.real/(numIterations*1e-9)}ns")
For Python 3.9 and numpy-1.24.1 this is roughly 6x times slower compared to "FromOriginalPost" and 16x times slower compared to "DirectAccessWithOMP" for gcc.
For gcc:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
FromOriginalPost 823 ns 823 ns 3397793
OptimizedOrder 573 ns 573 ns 4895246
DirectAccess 1306 ns 1306 ns 2142826
EigenAccessWithOMP 324 ns 324 ns 8655549
DirectAccessWithOMP 296 ns 296 ns 9418635
EigenChipAndContract 14405 ns 14405 ns 193548
EigenMapAndContract 11390 ns 11390 ns 243122
For clang:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
FromOriginalPost 753 ns 753 ns 3714543
OptimizedOrder 570 ns 570 ns 4921914
DirectAccess 569 ns 569 ns 4929755
EigenAccessWithOMP 2704 ns 2704 ns 1037819
DirectAccessWithOMP 1908 ns 1908 ns 1466390
EigenChipAndContract 17713 ns 17713 ns 157427
EigenMapAndContract 14064 ns 14064 ns 198875
Python:
np.einsum (per iteration): 4873.6035999991145 ns
Also on godbolt, however not really useful since the compiler times out there quite often.
Locally I compiled with -O3 -DNDEBUG -std=c++17 -mavx -fopenmp -Wall -Wextra
.
#include <iostream>
#include <iomanip>
#include <cmath>
#include <unsupported/Eigen/CXX11/Tensor>
#include <benchmark/benchmark.h>
//====================================================
// Globals
//====================================================
static constexpr int cDim12 = 2;
static constexpr int cDim3 = 136;
Eigen::Tensor<double, 3> CreateRandomTensor()
{
Eigen::Tensor<double, 3> m(cDim12, cDim12, cDim3);
m.setRandom();
return m;
}
Eigen::Tensor<double, 3> const L = CreateRandomTensor();
Eigen::Tensor<double, 3> const U = CreateRandomTensor();
//====================================================
// Helpers
//====================================================
Eigen::Tensor<double, 3> ReferenceResult()
{
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
result.setZero();
for (int i = 0; i < cDim12; i++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int l = 0; l < cDim3; l++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
return result;
}
void CheckResult(Eigen::Tensor<double, 3> const & result)
{
Eigen::Tensor<double, 3> const ref = ReferenceResult();
Eigen::Tensor<double, 3> const diff = ref - result;
Eigen::Tensor<double, 0> const max = diff.maximum();
Eigen::Tensor<double, 0> const min = diff.minimum();
double const maxDiff = std::max(std::abs(max(0)), std::abs(min(0)));
if (maxDiff > 1e-14) {
std::cerr << "ERROR! Max Diff = " << std::setprecision(17) << maxDiff << std::endl;
}
}
//====================================================
// Benchmarks
//====================================================
static void FromOriginalPost(benchmark::State& state)
{
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
for (int i = 0; i < cDim12; i++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int l = 0; l < cDim3; l++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(FromOriginalPost);
static void OptimizedOrder(benchmark::State& state)
{
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(OptimizedOrder);
static void DirectAccess(benchmark::State& state)
{
Eigen::Tensor<double, 3> U = ::U;
Eigen::Tensor<double, 3> L = ::L;
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
double * pR = result.data();
double * pU = U.data();
double * pL = L.data();
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(DirectAccess);
static void EigenAccessWithOMP(benchmark::State& state)
{
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
#pragma omp simd collapse(4)
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) += U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(EigenAccessWithOMP);
static void DirectAccessWithOMP(benchmark::State& state)
{
Eigen::Tensor<double, 3> U = ::U;
Eigen::Tensor<double, 3> L = ::L;
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
double * pR = result.data();
double * pU = U.data();
double * pL = L.data();
#pragma omp simd collapse(4) aligned(pR, pU, pL: 32) // 32: For AVX
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
pR[i + cDim12*(k + cDim12*l)] += pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(DirectAccessWithOMP);
static void EigenChipAndContract(benchmark::State& state)
{
Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
for (int l = 0; l < cDim3; l++) {
result.chip(l, 2) = U.chip(l, 2).contract(L.chip(l, 2), productDims);
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(EigenChipAndContract);
static void EigenMapAndContract(benchmark::State& state)
{
Eigen::Tensor<double, 3> U = ::U;
Eigen::Tensor<double, 3> L = ::L;
Eigen::array<Eigen::IndexPair<int>, 1> productDims = {Eigen::IndexPair<int>(1, 0)};
Eigen::Tensor<double, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
result.setZero();
for (int l = 0; l < cDim3; l++) {
Eigen::TensorMap<Eigen::Tensor<double, 2>> U_chip(U.data() + l * cDim12 * cDim12, cDim12, cDim12);
Eigen::TensorMap<Eigen::Tensor<double, 2>> L_chip(L.data() + l * cDim12 * cDim12, cDim12, cDim12);
Eigen::TensorMap<Eigen::Tensor<double, 2>> result_chip(result.data() + l * cDim12 * cDim12, cDim12, cDim12);
result_chip = U_chip.contract(L_chip, productDims);
}
benchmark::DoNotOptimize(result.data());
}
CheckResult(result);
}
BENCHMARK(EigenMapAndContract);
BENCHMARK_MAIN();
After the original post was edited, the arithmetic types used are not really built-ins but rather jets. Eigen can be extended to support custom types (as briefly outlined here). However, the Eigen::Tensor::contract()
function nevertheless does not "magically" support the equivalent of np.einsum('ijl,jkl->ikl', U, L)
since the last dimension l
does not really perform a contraction. Of course, one could write one, but this seems far from trivial.
If the only required contraction-like operation is the one from the original post, and also the tensors are not further multiplied/added/etc, the simplest thing to do is to implement the single loop manually and play around with compilers, compiler settings, pragmas, etc to figure out the best performance.
Jet type (adapted from here):
template<int N> struct Jet {
double a = 0.0;
Eigen::Matrix<double, 1, N> v = Eigen::Matrix<double, 1, N>::Zero();
};
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a + g.a, f.v + g.v};
}
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a * g.a, f.a * g.v + f.v * g.a};
}
For example (column-major)
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
SetToZero(result);
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
or with row-major order:
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> L = CreateRandomTensor<Eigen::RowMajor>();
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> U = CreateRandomTensor<Eigen::RowMajor>();
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> result(cDim12, cDim12, cDim3);
SetToZero(result);
for (int i = 0; i < cDim12; i++) {
for (int k = 0; k < cDim12; k++) {
for (int j = 0; j < cDim12; j++) {
for (int l = 0; l < cDim3; l++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
gcc and clang yield the same performance. They auto-vectorize the column-major loops, but apparently not the row-major ones. Direct access of the underlying data does not improve things. Moreover, adding #pragma omp simd collapse(4)
results in worse performance in both cases (clang also warns that the loops could not be vectorized); I guess the explicit SIMDs used in the matrix multiplication of Jet::v
internally by Eigen are the reason.
As an additional note again: The Eigen documentation says that you shouldn't really combine row-major order with Eigen::Tensor
:
The tensor library supports 2 layouts: ColMajor (the default) and RowMajor. Only the default column major layout is currently fully supported, and it is therefore not recommended to attempt to use the row major layout at the moment.
Full code:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <unsupported/Eigen/CXX11/Tensor>
#include <benchmark/benchmark.h>
static constexpr int cDim12 = 2;
static constexpr int cDim3 = 136;
template<int N> struct Jet {
double a = 0.0;
Eigen::Matrix<double, 1, N> v = Eigen::Matrix<double, 1, N>::Zero();
};
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a + g.a, f.v + g.v};
}
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator-(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a - g.a, f.v - g.v};
}
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a * g.a, f.a * g.v + f.v * g.a};
}
template<int N>
EIGEN_STRONG_INLINE Jet<N> operator/(const Jet<N>& f, const Jet<N>& g) {
return Jet<N>{f.a / g.a, f.v / g.a - f.a * g.v / (g.a * g.a)};
}
static constexpr int N = 8;
template <Eigen::StorageOptions storage>
auto CreateRandomTensor()
{
Eigen::Tensor<Jet<N>, 3, storage> result(cDim12, cDim12, cDim3);
for (int l = 0; l < cDim3; l++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> jet;
jet.a = (double)rand() / RAND_MAX;
jet.v.setRandom();
result(i, k, l) = jet;
}
}
}
return result;
}
template <class T>
void SetToZero(T & result)
{
for (int l = 0; l < cDim3; l++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
result(i, k, l) = Jet<N>{};
}
}
}
}
static void EigenAccessNoOMP(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(EigenAccessNoOMP);
static void EigenAccessNoOMPRowMajor(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> L = CreateRandomTensor<Eigen::RowMajor>();
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> U = CreateRandomTensor<Eigen::RowMajor>();
Eigen::Tensor<Jet<N>, 3, Eigen::RowMajor> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
for (int i = 0; i < cDim12; i++) {
for (int k = 0; k < cDim12; k++) {
for (int j = 0; j < cDim12; j++) {
for (int l = 0; l < cDim3; l++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(EigenAccessNoOMPRowMajor);
static void DirectAccessNoOMP(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
Jet<N> * pR = result.data();
Jet<N> * pU = U.data();
Jet<N> * pL = L.data();
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = pR[i + cDim12*(k + cDim12*l)];
r = r + pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(DirectAccessNoOMP);
static void EigenAccessWithOMP(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
#pragma omp simd collapse(4)
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = result(i, k, l);
r = r + U(i, j, l) * L(j, k, l);
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(EigenAccessWithOMP);
static void DirectAccessWithOMP(benchmark::State& state)
{
srand(42);
Eigen::Tensor<Jet<N>, 3> L = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> U = CreateRandomTensor<Eigen::ColMajor>();
Eigen::Tensor<Jet<N>, 3> result(cDim12, cDim12, cDim3);
for (auto _ : state) {
SetToZero(result);
Jet<N> * pR = result.data();
Jet<N> * pU = U.data();
Jet<N> * pL = L.data();
#pragma omp simd collapse(4) aligned(pR, pU, pL: 32)
for (int l = 0; l < cDim3; l++) {
for (int j = 0; j < cDim12; j++) {
for (int k = 0; k < cDim12; k++) {
for (int i = 0; i < cDim12; i++) {
Jet<N> & r = pR[i + cDim12*(k + cDim12*l)];
r = r + pU[i + cDim12*(j + cDim12*l)] * pL[j + cDim12*(k + cDim12*l)];
}
}
}
}
benchmark::DoNotOptimize(result.data());
}
}
BENCHMARK(DirectAccessWithOMP);
BENCHMARK_MAIN();