c++stl-algorithmdata-oriented-design

Understanding std::transform and how to beat it


I am trying to understand data-oriented design on a simple, specific problem. Apologies in advance to data-oriented design people, if I am doing something very stupid, but I am having a hard time understanding why and where my reasoning fails.

Assume that I have a simple operation, i.e., float_t result = int_t(lhs) / int_t(rhs). If I keep all of the variables in their corresponding containers, e.g., std::vector<float_t> and std::vector<int_t>, and I use std::transform, I get the correct result. Then, for a specific example where using float_t = float and using int_t = int16_t, I assume that packing these variables inside a struct, on a 64-bit architecture, and collecting them inside a container should yield better performance.

I reason that the struct makes up a 64-bit object, and a single memory access to the struct will give me all the variables I need. On the other hand, when all these variables are collected in different containers, I will need three different memory accesses to get the information needed. Below is how I set up the environment:

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>

using namespace std::chrono;

template <class float_t, class int_t> struct Packed {
  float_t sinvl;
  int_t s, l;
  Packed() = default;
  Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
  void comp() { sinvl = float_t(l) / s; }
};

using my_float = float;
using my_int = int16_t;

int main(int argc, char *argv[]) {
  constexpr uint32_t M{100};
  for (auto N : {1000, 10000, 100000}) {
    double t1{0}, t2{0};
    for (uint32_t m = 0; m < M; m++) {
      std::vector<my_float> sinvl(N, 0.0);
      std::vector<my_int> s(N, 3), l(N, 2);
      std::vector<Packed<my_float, my_int>> p1(
          N, Packed<my_float, my_int>(0.0, 3, 2));

      // benchmark unpacked
      auto tstart = high_resolution_clock::now();
      std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
                     std::divides<my_float>{}); // 3 different memory accesses
      auto tend = high_resolution_clock::now();
      t1 += duration_cast<microseconds>(tend - tstart).count();

      if (m == M - 1)
        std::cout << "sinvl[0]: " << sinvl[0] << '\n';

      // benchmark packed
      tstart = high_resolution_clock::now();
      for (auto &elem : p1) // 1 memory access
        elem.comp();
      tend = high_resolution_clock::now();
      t2 += duration_cast<microseconds>(tend - tstart).count();

      if (m == M - 1)
        std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
    }
    std::cout << "N = " << N << ", unpacked: " << (t1 / M) << " us.\n";
    std::cout << "N = " << N << ", packed: " << (t2 / M) << " us.\n";
  }
  return 0;
}

The compiled code with g++ -O3 yields, on my machine,

sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 1000, unpacked: 0 us.
N = 1000, packed: 1 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 10000, unpacked: 5.06 us.
N = 10000, packed: 12.97 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 100000, unpacked: 52.31 us.
N = 100000, packed: 124.49 us.

Basically, std::transform beats the packed access by 2.5x. I would appreciate if you helped me understand the behaviour. Is the result due to

  1. me not grasping the data-oriented design principles correctly, or,
  2. some artefact of this very simple example such as the memory locations getting allocated very close to each other and in some way getting optimized very efficiently by the compiler?

Finally, is there a way to beat std::transform in this example, or, is it simply good enough to be a go-to solution? I am not an expert neither in compiler optimizations nor in data-oriented design, and thus, I could not answer this question myself.

Thanks!

EDIT. I have changed the way I test both of the methods as per @bolov's suggestion in the comments.

Now the code looks like:

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>

using namespace std::chrono;

template <class float_t, class int_t> struct Packed {
  float_t sinvl;
  int_t s, l;
  Packed() = default;
  Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
  void comp() { sinvl = float_t(l) / s; }
};

using my_float = float;
using my_int = int16_t;

int main(int argc, char *argv[]) {
  uint32_t N{1000};
  double t{0};

  if (argc == 2)
    N = std::stoul(argv[1]);

#ifndef _M_PACKED
  std::vector<my_float> sinvl(N, 0.0);
  std::vector<my_int> s(N, 3), l(N, 2);

  // benchmark unpacked
  auto tstart = high_resolution_clock::now();
  std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
                 std::divides<my_float>{}); // 3 different memory accesses
  auto tend = high_resolution_clock::now();
  t += duration_cast<microseconds>(tend - tstart).count();

  std::cout << "sinvl[0]: " << sinvl[0] << '\n';
  std::cout << "N = " << N << ", unpacked: " << t << " us.\n";
#else
  std::vector<Packed<my_float, my_int>> p1(N,
                                           Packed<my_float, my_int>(0.0, 3, 2));
  // benchmark packed
  auto tstart = high_resolution_clock::now();
  for (auto &elem : p1) // 1 memory access
    elem.comp();
  auto tend = high_resolution_clock::now();
  t += duration_cast<microseconds>(tend - tstart).count();

  std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
  std::cout << "N = " << N << ", packed: " << t << " us.\n";
#endif

  return 0;
}

with the corresponding shell (fish) script

g++ -Wall -std=c++11 -O3 transform.cpp -o transform_unpacked.out
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_packed.out -D_M_PACKED
for N in 1000 10000 100000
  echo "Testing unpacked for N = $N"
  ./transform_unpacked.out $N
  ./transform_unpacked.out $N
  ./transform_unpacked.out $N
  echo "Testing packed for N = $N"
  ./transform_packed.out $N
  ./transform_packed.out $N
  ./transform_packed.out $N
end

which gives the following:

Testing unpacked for N = 1000
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
Testing packed for N = 1000
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
Testing unpacked for N = 10000
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
Testing packed for N = 10000
p1[0].sinvl: 0.666667
N = 10000, packed: 17 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
Testing unpacked for N = 100000
sinvl[0]: 0.666667
N = 100000, unpacked: 64 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
Testing packed for N = 100000
p1[0].sinvl: 0.666667
N = 100000, packed: 180 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 198 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 177 us.

I hope I have understood the proper testing method, correctly. Still, though, the difference is 2-3 folds.


Solution

  • Here's the compiled loop of the std::transform case:

      400fd0:       f3 41 0f 7e 04 47       movq   xmm0,QWORD PTR [r15+rax*2]
      400fd6:       66 0f 61 c0             punpcklwd xmm0,xmm0
      400fda:       66 0f 72 e0 10          psrad  xmm0,0x10
      400fdf:       0f 5b c0                cvtdq2ps xmm0,xmm0
      400fe2:       f3 0f 7e 0c 43          movq   xmm1,QWORD PTR [rbx+rax*2]
      400fe7:       66 0f 61 c9             punpcklwd xmm1,xmm1
      400feb:       66 0f 72 e1 10          psrad  xmm1,0x10
      400ff0:       0f 5b c9                cvtdq2ps xmm1,xmm1
      400ff3:       0f 5e c1                divps  xmm0,xmm1
      400ff6:       41 0f 11 04 80          movups XMMWORD PTR [r8+rax*4],xmm0
      400ffb:       f3 41 0f 7e 44 47 08    movq   xmm0,QWORD PTR [r15+rax*2+0x8]
      401002:       66 0f 61 c0             punpcklwd xmm0,xmm0
      401006:       66 0f 72 e0 10          psrad  xmm0,0x10
      40100b:       0f 5b c0                cvtdq2ps xmm0,xmm0
      40100e:       f3 0f 7e 4c 43 08       movq   xmm1,QWORD PTR [rbx+rax*2+0x8]
      401014:       66 0f 61 c9             punpcklwd xmm1,xmm1
      401018:       66 0f 72 e1 10          psrad  xmm1,0x10
      40101d:       0f 5b c9                cvtdq2ps xmm1,xmm1
      401020:       0f 5e c1                divps  xmm0,xmm1
      401023:       41 0f 11 44 80 10       movups XMMWORD PTR [r8+rax*4+0x10],xmm0
      401029:       48 83 c0 08             add    rax,0x8
      40102d:       48 83 c1 02             add    rcx,0x2
      401031:       75 9d                   jne    400fd0 <main+0x570>
    

    In each loop cycle, it processes 8 elements (there are two divps instructions, each does 4 divisions).

    Here's the other case:

      401190:       f3 0f 6f 42 04          movdqu xmm0,XMMWORD PTR [rdx+0x4]
      401195:       f3 0f 6f 4a 14          movdqu xmm1,XMMWORD PTR [rdx+0x14]
      40119a:       66 0f 70 c9 e8          pshufd xmm1,xmm1,0xe8
      40119f:       66 0f 70 c0 e8          pshufd xmm0,xmm0,0xe8
      4011a4:       f2 0f 70 d0 e8          pshuflw xmm2,xmm0,0xe8
      4011a9:       66 0f 6c c1             punpcklqdq xmm0,xmm1
      4011ad:       66 0f 72 e0 10          psrad  xmm0,0x10
      4011b2:       0f 5b c0                cvtdq2ps xmm0,xmm0
      4011b5:       f2 0f 70 c9 e8          pshuflw xmm1,xmm1,0xe8
      4011ba:       66 0f 62 d1             punpckldq xmm2,xmm1
      4011be:       66 0f 61 ca             punpcklwd xmm1,xmm2
      4011c2:       66 0f 72 e1 10          psrad  xmm1,0x10
      4011c7:       0f 5b c9                cvtdq2ps xmm1,xmm1
      4011ca:       0f 5e c1                divps  xmm0,xmm1
      4011cd:       f3 0f 11 02             movss  DWORD PTR [rdx],xmm0
      4011d1:       0f 28 c8                movaps xmm1,xmm0
      4011d4:       0f c6 c9 e5             shufps xmm1,xmm1,0xe5
      4011d8:       f3 0f 11 4a 08          movss  DWORD PTR [rdx+0x8],xmm1
      4011dd:       0f 28 c8                movaps xmm1,xmm0
      4011e0:       0f 12 c9                movhlps xmm1,xmm1
      4011e3:       f3 0f 11 4a 10          movss  DWORD PTR [rdx+0x10],xmm1
      4011e8:       0f c6 c0 e7             shufps xmm0,xmm0,0xe7
      4011ec:       f3 0f 11 42 18          movss  DWORD PTR [rdx+0x18],xmm0
      4011f1:       48 83 c2 20             add    rdx,0x20
      4011f5:       48 83 c1 fc             add    rcx,0xfffffffffffffffc
      4011f9:       75 95                   jne    401190 <main+0x730>
    

    In each loop cycle, it processes 4 elements (there is one divps instruction).

    In the first case, data is in a good format, SIMD instructions can operate on them (almost) without any data-moving, and the result can be written easily (4 results are written with a single instruction).

    In the second case, however, this is not the case. The compiler had to emit a lot of data-moving (shuffle) operations, and each result is written with a separate instruction. So the input/output is not in a SIMD friendly format.

    I don't have the time to analyse this issue further, but if you just take the fact that both of this snippets has similar size, similar instructions, but the first processes twice as much as elements as the second one, you can have the idea why the second is slower. Sorry about the sloppy explanation.