c++stdvectormemory-alignmentalignas

Alignment for vector of vectors in C++ templated type


In C++, what is the shortest way to declare a vector of vectors, where each inner std::vector's metadata fields (importantly size and pointer) have a user-chosen alignment?

Ideally I'd be able to specify the alignment directly inside the template, e.g.

vector< alignas(64) vector<int> > my_vector;
// or
vector< alignas(std::hardware_destructive_interference_size) vector<int> > my_vector;

But that doesn't seem to be valid syntax for the placement of the alignas keyword.

In absence of this possibility, what is the shortest possible alternative that is still generic (does not create e.g. a non-templated wrapper struct of concrete type to place the alignas)?

I found some hints to approaches in the Reddit thread "Idiomatic way to avoid false sharing" but the boilerplate still looks rather large.

Drop-in API requirement

A simple way might be something like

template <typename T, std::size_t Alignment>
struct AlignedObject {
    alignas(Alignment) T object;
};

but this adds extra wrapping (requiring access through .object).

I'm looking to declare just the alignment, without having to change the interface through which my_vector needs to be accessed, so that the solution is a proper API-compatible drop-in replacement for vector<vector<T>> with the same constructors, methods, behaviour, future-proofness (for when std::vector may get changed in the future) and so on.

Remark about importance in parallel programs

Aligning the vector metadata fields is important to avoid false sharing in parallel programs. It is extremely common to have a parallelised loop where each loop iteration writes its own output vector, thus a vector<vector<T>> is needed.

Each thread writes to its respective vector, so you'd think this should have perfectly-linear speedup; but because the vectors' size fields are near each other (fitting into the same cache line), false sharing triggers, and e.g. a push_back() of one thread slows down that of another thread. This is explained e.g. in the talk "Faster than memcpy".

Edit: Alternative solution to this subproblem via std::move()

Commenter Loki Astari points out below that in most cases, this can be solved by std::move(inner_vector), so that each thread has its own stack-allocated metadata fields, on which repeatd fast updates can be performed (e.g. changing the size (__end_ pointer) via .push_back()) without triggering false-sharing. But this question is still useful for situations where this is not desired.

Remark for comparing solutions

I found that for quick testing e.g. on godbolt.org, using an outer array instead of vector can be useful to quickly check the involved alignments, e.g.:

std::array<      std::vector<int>              , 10> my_vector;
std::array< /* insert solution approach here */, 10> my_vector;

Related questions

Thank you!


Solution

  • You could make a new class that you align to your liking. Here I've inherited from std::vector. You could use composition if you'd like instead:

    template <class T, std::size_t Align, class Allocator = std::allocator<T>>
    struct alignas(Align) aligned_vector : std::vector<T, Allocator> {
        using std::vector<T>::vector;
        using std::vector<T>::operator=;
    };
    

    Usage:

    std::vector<aligned_vector<int, 64>> v;
    
    // check the alignment of the inner vector
    static_assert(alignof(decltype(v)::value_type) == 64);
    
    // do things in parallel with no false sharing of the inner vectors
    // meta data (given a cache line <= 64 bytes big):
    std::for_each(std::execution::par, v.begin(), v.end(), [](auto& inner) {
        // work on the inner vector here
    });