I am trying to understand data-oriented design on a simple, specific problem. Apologies in advance to data-oriented design people, if I am doing something very stupid, but I am having a hard time understanding why and where my reasoning fails.
Assume that I have a simple operation, i.e., float_t result = int_t(lhs) / int_t(rhs)
. If I keep all of the variables in their corresponding containers, e.g., std::vector<float_t>
and std::vector<int_t>
, and I use std::transform
, I get the correct result. Then, for a specific example where using float_t = float
and using int_t = int16_t
, I assume that packing these variables inside a struct
, on a 64-bit architecture, and collecting them inside a container should yield better performance.
I reason that the struct
makes up a 64-bit object, and a single memory access to the struct
will give me all the variables I need. On the other hand, when all these variables are collected in different containers, I will need three different memory accesses to get the information needed. Below is how I set up the environment:
#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>
using namespace std::chrono;
template <class float_t, class int_t> struct Packed {
float_t sinvl;
int_t s, l;
Packed() = default;
Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
void comp() { sinvl = float_t(l) / s; }
};
using my_float = float;
using my_int = int16_t;
int main(int argc, char *argv[]) {
constexpr uint32_t M{100};
for (auto N : {1000, 10000, 100000}) {
double t1{0}, t2{0};
for (uint32_t m = 0; m < M; m++) {
std::vector<my_float> sinvl(N, 0.0);
std::vector<my_int> s(N, 3), l(N, 2);
std::vector<Packed<my_float, my_int>> p1(
N, Packed<my_float, my_int>(0.0, 3, 2));
// benchmark unpacked
auto tstart = high_resolution_clock::now();
std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
std::divides<my_float>{}); // 3 different memory accesses
auto tend = high_resolution_clock::now();
t1 += duration_cast<microseconds>(tend - tstart).count();
if (m == M - 1)
std::cout << "sinvl[0]: " << sinvl[0] << '\n';
// benchmark packed
tstart = high_resolution_clock::now();
for (auto &elem : p1) // 1 memory access
elem.comp();
tend = high_resolution_clock::now();
t2 += duration_cast<microseconds>(tend - tstart).count();
if (m == M - 1)
std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
}
std::cout << "N = " << N << ", unpacked: " << (t1 / M) << " us.\n";
std::cout << "N = " << N << ", packed: " << (t2 / M) << " us.\n";
}
return 0;
}
The compiled code with g++ -O3
yields, on my machine,
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 1000, unpacked: 0 us.
N = 1000, packed: 1 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 10000, unpacked: 5.06 us.
N = 10000, packed: 12.97 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 100000, unpacked: 52.31 us.
N = 100000, packed: 124.49 us.
Basically, std::transform
beats the packed access by 2.5x
. I would appreciate if you helped me understand the behaviour. Is the result due to
Finally, is there a way to beat std::transform
in this example, or, is it simply good enough to be a go-to solution? I am not an expert neither in compiler optimizations nor in data-oriented design, and thus, I could not answer this question myself.
Thanks!
EDIT. I have changed the way I test both of the methods as per @bolov's suggestion in the comments.
Now the code looks like:
#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>
using namespace std::chrono;
template <class float_t, class int_t> struct Packed {
float_t sinvl;
int_t s, l;
Packed() = default;
Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
void comp() { sinvl = float_t(l) / s; }
};
using my_float = float;
using my_int = int16_t;
int main(int argc, char *argv[]) {
uint32_t N{1000};
double t{0};
if (argc == 2)
N = std::stoul(argv[1]);
#ifndef _M_PACKED
std::vector<my_float> sinvl(N, 0.0);
std::vector<my_int> s(N, 3), l(N, 2);
// benchmark unpacked
auto tstart = high_resolution_clock::now();
std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
std::divides<my_float>{}); // 3 different memory accesses
auto tend = high_resolution_clock::now();
t += duration_cast<microseconds>(tend - tstart).count();
std::cout << "sinvl[0]: " << sinvl[0] << '\n';
std::cout << "N = " << N << ", unpacked: " << t << " us.\n";
#else
std::vector<Packed<my_float, my_int>> p1(N,
Packed<my_float, my_int>(0.0, 3, 2));
// benchmark packed
auto tstart = high_resolution_clock::now();
for (auto &elem : p1) // 1 memory access
elem.comp();
auto tend = high_resolution_clock::now();
t += duration_cast<microseconds>(tend - tstart).count();
std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
std::cout << "N = " << N << ", packed: " << t << " us.\n";
#endif
return 0;
}
with the corresponding shell (fish) script
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_unpacked.out
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_packed.out -D_M_PACKED
for N in 1000 10000 100000
echo "Testing unpacked for N = $N"
./transform_unpacked.out $N
./transform_unpacked.out $N
./transform_unpacked.out $N
echo "Testing packed for N = $N"
./transform_packed.out $N
./transform_packed.out $N
./transform_packed.out $N
end
which gives the following:
Testing unpacked for N = 1000
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
Testing packed for N = 1000
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
Testing unpacked for N = 10000
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
Testing packed for N = 10000
p1[0].sinvl: 0.666667
N = 10000, packed: 17 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
Testing unpacked for N = 100000
sinvl[0]: 0.666667
N = 100000, unpacked: 64 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
Testing packed for N = 100000
p1[0].sinvl: 0.666667
N = 100000, packed: 180 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 198 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 177 us.
I hope I have understood the proper testing method, correctly. Still, though, the difference is 2-3 folds.
Here's the compiled loop of the std::transform
case:
400fd0: f3 41 0f 7e 04 47 movq xmm0,QWORD PTR [r15+rax*2]
400fd6: 66 0f 61 c0 punpcklwd xmm0,xmm0
400fda: 66 0f 72 e0 10 psrad xmm0,0x10
400fdf: 0f 5b c0 cvtdq2ps xmm0,xmm0
400fe2: f3 0f 7e 0c 43 movq xmm1,QWORD PTR [rbx+rax*2]
400fe7: 66 0f 61 c9 punpcklwd xmm1,xmm1
400feb: 66 0f 72 e1 10 psrad xmm1,0x10
400ff0: 0f 5b c9 cvtdq2ps xmm1,xmm1
400ff3: 0f 5e c1 divps xmm0,xmm1
400ff6: 41 0f 11 04 80 movups XMMWORD PTR [r8+rax*4],xmm0
400ffb: f3 41 0f 7e 44 47 08 movq xmm0,QWORD PTR [r15+rax*2+0x8]
401002: 66 0f 61 c0 punpcklwd xmm0,xmm0
401006: 66 0f 72 e0 10 psrad xmm0,0x10
40100b: 0f 5b c0 cvtdq2ps xmm0,xmm0
40100e: f3 0f 7e 4c 43 08 movq xmm1,QWORD PTR [rbx+rax*2+0x8]
401014: 66 0f 61 c9 punpcklwd xmm1,xmm1
401018: 66 0f 72 e1 10 psrad xmm1,0x10
40101d: 0f 5b c9 cvtdq2ps xmm1,xmm1
401020: 0f 5e c1 divps xmm0,xmm1
401023: 41 0f 11 44 80 10 movups XMMWORD PTR [r8+rax*4+0x10],xmm0
401029: 48 83 c0 08 add rax,0x8
40102d: 48 83 c1 02 add rcx,0x2
401031: 75 9d jne 400fd0 <main+0x570>
In each loop cycle, it processes 8 elements (there are two divps
instructions, each does 4 divisions).
Here's the other case:
401190: f3 0f 6f 42 04 movdqu xmm0,XMMWORD PTR [rdx+0x4]
401195: f3 0f 6f 4a 14 movdqu xmm1,XMMWORD PTR [rdx+0x14]
40119a: 66 0f 70 c9 e8 pshufd xmm1,xmm1,0xe8
40119f: 66 0f 70 c0 e8 pshufd xmm0,xmm0,0xe8
4011a4: f2 0f 70 d0 e8 pshuflw xmm2,xmm0,0xe8
4011a9: 66 0f 6c c1 punpcklqdq xmm0,xmm1
4011ad: 66 0f 72 e0 10 psrad xmm0,0x10
4011b2: 0f 5b c0 cvtdq2ps xmm0,xmm0
4011b5: f2 0f 70 c9 e8 pshuflw xmm1,xmm1,0xe8
4011ba: 66 0f 62 d1 punpckldq xmm2,xmm1
4011be: 66 0f 61 ca punpcklwd xmm1,xmm2
4011c2: 66 0f 72 e1 10 psrad xmm1,0x10
4011c7: 0f 5b c9 cvtdq2ps xmm1,xmm1
4011ca: 0f 5e c1 divps xmm0,xmm1
4011cd: f3 0f 11 02 movss DWORD PTR [rdx],xmm0
4011d1: 0f 28 c8 movaps xmm1,xmm0
4011d4: 0f c6 c9 e5 shufps xmm1,xmm1,0xe5
4011d8: f3 0f 11 4a 08 movss DWORD PTR [rdx+0x8],xmm1
4011dd: 0f 28 c8 movaps xmm1,xmm0
4011e0: 0f 12 c9 movhlps xmm1,xmm1
4011e3: f3 0f 11 4a 10 movss DWORD PTR [rdx+0x10],xmm1
4011e8: 0f c6 c0 e7 shufps xmm0,xmm0,0xe7
4011ec: f3 0f 11 42 18 movss DWORD PTR [rdx+0x18],xmm0
4011f1: 48 83 c2 20 add rdx,0x20
4011f5: 48 83 c1 fc add rcx,0xfffffffffffffffc
4011f9: 75 95 jne 401190 <main+0x730>
In each loop cycle, it processes 4 elements (there is one divps
instruction).
In the first case, data is in a good format, SIMD instructions can operate on them (almost) without any data-moving, and the result can be written easily (4 results are written with a single instruction).
In the second case, however, this is not the case. The compiler had to emit a lot of data-moving (shuffle) operations, and each result is written with a separate instruction. So the input/output is not in a SIMD friendly format.
I don't have the time to analyse this issue further, but if you just take the fact that both of this snippets has similar size, similar instructions, but the first processes twice as much as elements as the second one, you can have the idea why the second is slower. Sorry about the sloppy explanation.