I am trying to get a good grip on data oriented design and how to program best with the cache in mind. There's basically two scenarios that I cannot quite decide which is better and why - is it better to have a vector of objects, or several vectors with the objects atomic data?
A) Vector of objects example
struct A
{
GLsizei mIndices;
GLuint mVBO;
GLuint mIndexBuffer;
GLuint mVAO;
size_t vertexDataSize;
size_t normalDataSize;
};
std::vector<A> gMeshes;
for_each(gMeshes as mesh)
{
glBindVertexArray(mesh.mVAO);
glDrawElements(GL_TRIANGLES, mesh.mIndices, GL_UNSIGNED_INT, 0);
glBindVertexArray(0);
....
}
B) Vectors with the atomic data
std::vector<GLsizei> gIndices;
std::vector<GLuint> gVBOs;
std::vector<GLuint> gIndexBuffers;
std::vector<GLuint> gVAOs;
std::vector<size_t> gVertexDataSizes;
std::vector<size_t> gNormalDataSizes;
size_t numMeshes = ...;
for (index = 0; index++; index < numMeshes)
{
glBindVertexArray(gVAOs[index]);
glDrawElements(GL_TRIANGLES, gIndices[index], GL_UNSIGNED_INT, 0);
glBindVertexArray(0);
....
}
Which one is more memory efficient and cache friendly resulting in less cache misses and better performance, and why?
With some variation according to which level of cache you're talking about, cache works as follows:
- if the data is already in cache then it is fast to access
- if the data is not in cache then you incur a cost, but an entire cache line (or page, if we're talking RAM vs swap file rather than cache vs RAM) is brought into cache, so access close to the missed address will not miss.
- if you're lucky then the memory subsystem will detect sequential access and pre-fetch data that it thinks you're about to need.
So naively the questions to ask are:
- how many cache misses occur? -- B wins, because in A you fetch some unused data per record, whereas in B you fetch none other than a small rounding error at the end of the iteration. So in order to visit all of the necessary data, B fetches fewer cache lines, assuming a significant number of records. If the number of records is insignificant, then cache performance may have little or nothing to do with the performance of your code, because a program that uses a small enough amount of data will find that it's all in cache all the time.
- is the access sequential? -- yes in both cases, although this might be harder to detect in case B because there are two interleaved sequences rather than just one.
So, I would sort of expect B to be faster for this code. However:
- if this is the only access to the data, then you could speed up A by removing most of the data members from the
struct
. So do that. Presumably in fact it is not the only access to the data in your program, and the other accesses might affect performance in two ways: the time they actually take, and whether they populate the cache with the data you need.
- what I expect and what actually happens are frequently different things, and there is little point relying on speculation if you have any ability to test it. In the best case, the sequential access means that there are no cache misses in either code. Testing performance requires no special tool (although they can make it easier), just a clock with a second hand. At a pinch, fashion a pendulum from your phone charger.
- there are some complications I have ignored. Depending on hardware, if you're unlucky with B then at the lowest cache level you could find that the accesses to one vector are evicting the accesses to the other vector, because the corresponding memory just happens to use the same location in cache. This would cause two cache misses per record. This will only happen on what's called "direct-mapped cache". "Two-way cache" or better would save the day, by allowing chunks of both vectors to co-exist even if their first preference location in cache is the same. I don't think that PC hardware generally uses direct-mapped cache, but I don't know for sure and I don't know much about GPUs.