c++memorysimddata-oriented-design

Intuition about memory layout for fast SIMD / data oriented design


I have been watching data-oriented-design talks recently, but I never understood the reasoning behind their unanimously chosen memory layout.

Lets say we have a 3D animation to render, and in each frame we need to re-normalize our orientation vectors.

The "Scalar code"

They always show code that might look something like this:

let scene = [{"camera1", vec4{1, 1, 1, 1}}, ...]

for object in scene
    object.orientation = normalize(object.orientation)

So far so good... The memory at &scene might look roughly thus:

[string,X,Y,Z,W,string,X,Y,Z,W,string,X,Y,Z,W,...]

"SSE aware code"

Every talk then shows the improved, cookie-cutter, version:

let xs = [1, ...]
let ys = [1, ...]
let zs = [1, ...]
let ws = [1, ...]
let scene = [{"camera1", ptr_vec4{&xs[1], &ys[1], &zs[1], &ws[1]}}, ...]

for (o1, o2, o3, o4) in scene
    (o1, o2, o3, o4) = normalize_sse(o1, o2, o3, o4)

Which, due to it's memory layout, is not only more memory-efficient, but can also process our scene 4 objects at a time.
Memory at &xs, &ys, &zs, and &ws

[X,X,X,X,X,X,...]
[Y,Y,Y,Y,Y,Y,...]
[Z,Z,Z,Z,Z,Z,...]
[W,W,W,W,W,W,...]

But why 4 separate arrays?

If the __m128 (packed-4-singles) is the predominant type in engines,
    which i believe it is;
and if the type is 128 bits long,
    which it definitely is;
and if the cache line width / 128 = 4,
    which it almost always does;
and if x86_64 is only capable of writing a full cache line,
    which I am almost certain of
- why is the data not structured as follows instead?!

Memory at &packed_orientations:

[X,X,X,X,Y,Y,Y,Y,Z,Z,Z,Z,W,W,W,W,X,X,...]
 ^---------cache-line------------^

I have no benchmark to test this on, and I don't understand intrinsics enough to even try, but by my intuition, should this not be way faster? We would be saving 4x page loads and writes, simplifying allocations, and saving pointers, and the code would be simpler since instead of 4 pointers we can do pointer addition. Am I wrong?

Thank you! :)


Solution

  • The amount of data you need to get through your memory subsystem is identical no matter whether you do 4 separate arrays or your suggested interleaving. You therefore don't save page loads or writes (I don't see why the "separate arrays" case should read and write each page or cache line more than once).

    You do disperse the memory transfers more - you might have 1 L1 cache miss every iteration in your case and 4 cache misses every 4th iteration in the "separate arrays" case. I don't know which one would be preferred.

    Anyway, the main point is to not have unnecessary memory pushed through your caches that you don't interact with. In your example, having string values that are neither read nor written but still pushed through the caches needlessly takes up bandwidth.