Data oriented design - preparing data for cache friendly iteration vs direct access

in Mike Actons talks here and here (links are with timestamps) as well as in this blog post they prepare/precondition/prefetch/query needed, non-contiguous data first to then iterate them / do calculations in a cache friendly way.

Here is the code from the blog post:

// objects_store.h

struct ObjectsStore
{
    vector<Vector3D> m_Positions;
    vector<Quaternion> m_Orientations;
};

// Runtime solution:

struct TranslationData
{
    Vector3D m_ObjPosition;
    const Quaternion m_ObjOrientation;
    const Vector3D m_ObjTranslation;
};

void PrepareTranslations(const ObjectsStore& inStore,
                         const vector<Vector3D>& inTranslations,
                         vector<TranslationData>& outObjectsToTranslate)
{
    assert(inStore.m_Positions.size() == inStore.m_Orientations.size());
    assert(inStore.m_Positions.size() == inTranslations.size());
    for (size_t i = 0; i < inTranslations.size(); ++i)
    {
        outObjectsToTranslate.push_back
        ({ 
             inStore.m_Positions[i];
             inStore.m_Orientations[i];
             inTranslations[i];
        });
    }
}

void TranslateObject(vector<TranslationData>& ioObjectsToTranslate)
{
    for (TranslationData& data: ioObjectsToTranslate)
        data.m_ObjPosition += data.m_ObjOrientation * data.m_ObjTranslation;
}

void ApplyTranslation(const vector<TranslationData>& inTranslationData
                      ObjectsStore& outStore)
{
    assert(inStore.m_Positions.size() == inTranslationData.size());
    for (size_t i = 0; i < inTranslationData.size(); ++i)
        outStore.m_Positions[i] = inTranslationData[i].m_ObjPosition;
}

void UpdateGame(ObjectsStore& ioStore, vector<Vector3D>& inTranslations)
{
   vector<TranslationData> translation_data;
   PrepareTranslations(ioStore, inTranslations, translation_data);
   TranslateObject(translation_data);
   ApplyTranslation(translation_data, ioStore);
}

My question is: in order to prepare the data in such a way, one has to access the data anyway (and also copy it), so I wonder even if the data is more cache friendly afterwards, wouldn't it be more efficient to modify the data directly and spare the preparing step?

As far as I understand it, obviously calculations are then very efficient on the prepared data, but with the additional cost of preparing them in the first place.

Can anyone tell me what I am missing here? Does one take the extra cost to gain other benefits (like clearer code etc.) or is it actually more efficient than direct modification?

Thanks in advance

Solution

Here it does not worth preparing data. In fact, it will certainly be slower.

First of all, PrepareTranslations calls push_back but no reserve is called before so the vector will be resized O(log n) times. Not great for memory accesses especially for large vectors.

The following points needs to be considered:

PrepareTranslations should be memory bound and not SIMD friendly (push_back certainly prevent any SIMD optimization).
TranslateObject should be rather compute bound (though it might not be benefit well from AVX2/AVX-512 which operate on respectively 8 items and 16 items meanwhile the Vector3D should have 3 attributes and the Quaternion should have 4 attributes).
ApplyTranslation should be memory bound.

The thing is if you merge the 3 operations in one loop, the CPU can hide the memory latency with computation, read/write data while doing computation, and move less data in memory overall. Here preparing data give almost no benefit and introduce additional costs so the result is certainly a slower execution.

Preparing data would be a good idea if you can then make the main computations more SIMD friendly for example. It can also be a good idea to reduce latency stalls of complex loops with loop-carried dependency preventing the CPU to fetch a lot of data from memory concurrently. Loop splitting can also help to avoid cache trashing in some cases (e.g. when you access many arrays aligned on a large power of two in memory in the same loop).

Here is a practical example: you have a list of N 2D positions (eg. player) you want to update based on a function requiring 8 arrays of N items and the computation requires doing accesses to a large hash-map. In this case, it is better to:

use 2 arrays (one per coordinate) for the list of positions (see AoS vs SoA);
fetch the hash-map values ahead of time and put the result in a temporary array;
do the computation in a SIMD-friendly way (possibly using 2 loops regarding the target machine, and/or chunk by chunk).

Indeed, the hash-map will prevent any SIMD optimization by mainstream compilers. The same thing would be true for a code having conditional branches that cannot be vectorized.

In your case, using one array per coordinate will certainly speed things a bit assuming you do not need to convert the SoA to AoS every time (consider keeping them as SoA as much as possible). Indeed, this make the computing operation more SIMD friendly. Indeed, modern mainstream x86-64 CPUs can operate on 8 float items at a time using a single AVX/AVX2 instruction and even 16 float with AVX-512 (so a Vector3D data structure does not fit well). If the arrays contain a lot of items, then AoSoA is the best layout (since AoS is not SIMD friendly and SoA tend to cause cache trashing on large arrays). However, this is a pain to write (and maintain) codes using AoSoA data structures, not to mention changing the layout is also complicated and often needed because of external libraries.

Last but not least, when you have to fetch many data structure per item of an array, the accesses will look random and mainstream CPUs cannot prefetch data easily. In this case, they will just start the memory loads as early as possible, but this is often not enough since the memory latency can be really huge (especially for in-RAM data) and there is generally not enough computation to hide it. Manual prefetching can help in this case but it is not a silver-bullet and it is brittle (dependent of the target architecture). In this case, move data into arrays (i.e. switching from AoS to SoA) can help a bit because hardware prefetchers can more efficiently prefetch data with a small constant stride.

Note on GPUs

If some of your computation can be ported to GPUs, then be aware that SoA is generally far more efficient on them than AoS (because of coalescence and also because GPUs are SIMT device). The benefit of AoSoA on GPUs is often small so it does not worth the effort on such platform.