I would like to build a dense integer set in C++ using the trick described at https://research.swtch.com/sparse . This approach achieves good performance by allowing itself to read uninitialized memory.
How can I implement this data structure without triggering undefined behavior, and without running afoul of tools like Valgrand or ASAN?
Edit: It seems like responders are focusing on the word "uninitialized" and interpreting it in the context of the language standard. That was probably a poor word choice on my part - here "uninitialized" means only that its value is not important to the correct functioning of the algorithm. It's obviously possible to implement this data structure safely (LLVM does it in SparseMultiSet). My question is what is the best and most performant way to do so?
I can see four basic approaches you can take. These are applicable not only to C++ but also most other low-level languages like C that make uninitialized access possible but not allowed, and the last is applicable even to higher-level "safe" languages.
This is likely to result in broken code on at least some modern compilers, as the comments note clang 15 will optimize away code which it knows accesses uninitialized values, leaving this approach broken and unsafe.
Should you want to pursue this anyways, it would have the best chance of working safest if the implementation of the methods that access uninitialized memory was compiled, once, in a separate compilation unit - this makes it easy to check that it does the right thing (just check the assembly once) and makes it very difficult (outside of LTGC) for the compiler to do anything tricky, since it can't prove whether uninitialized values are being accessed. Similarly, you could add compiler barriers such as empty asm
blocks to prevent optimization.
Still, this approach is theoretically unsafe and you should check the compiled output very carefully and have additional safeguards in place if you take it.
If you take this approach, tools like valgrind
are fairly likely to report a uninitialized read error.
Now these tools work at the assembly level, and some uninitialized reads may be fine (see, for example, the next item on fast standard library implementations), so they don't actually report a uninitialized read immediately, but rather have a variety of heuristics to determine if invalid values are actually used. For example, they may avoid reporting an error until they determine the uninitialized value is used to determine the direction of a conditional jump, or some other action that is not trackable/recoverable according to the heuristic. You may be able to get the compiler to emit code that reads uninitialized memory but is safe according to this heuristic.
More likely, you won't be able to do that (since the logic here is fairly subtle as it relies on the relationship between the values in two arrays), so you can use the suppression options in your tools of choice to hide the errors. For example, valgrind can suppress based on the stack trace - and in fact there are already many such suppression entries used by default to hide false-positives in various standard libraries.
Since it works based on stack traces, you'll probably have difficulties if the reads occur in inlined code, since the top-of-stack will then be different for every call-site. You could avoid this my making sure the function is not inlined.
What is ill-defined in the standard, is usually well-defined at the assembly level. It is why the compiler and standard library can often implement things in a faster way than you could achieve with C or C++: a libc
routine written in assembly is already targeting a specific architecture and doesn't have to worry about the various caveats in the language specification that are there to make things run fast on a variety of hardware.
Normally, implementing any serious amount of code in assembly is a costly undertaking, but here it is only a handful, so it may be feasible depending on how many platforms you are targeting. You don't even really need to write the methods yourself - just compile the C++ version (or use godbolt and copy the assembly. The is_member
function, for example1, looks like:
sparse_array::is_member(unsigned long):
mov rax, QWORD PTR [rdi+16]
mov rdx, QWORD PTR [rax+rsi*8]
xor eax, eax
cmp rdx, QWORD PTR [rdi]
jnb .L1
mov rax, QWORD PTR [rdi+8]
cmp QWORD PTR [rax+rdx*8], rsi
sete al
calloc
magicIf you use calloc
2, you explicitly request zeroed memory from the underlying allocator. Now a correct version of calloc
may simply call malloc
and then zero out the returned memory, but actual implementations rely on the fact that the OS-level memory allocation routines (sbrk
and mmap
, pretty much) will generally return you zeroed memory on any OS with protected memory (i.e., all the big ones), to avoid zeroing out the memory again.
As a practical matter, for large allocations, this is typically satisfied by implementing a call like anonymous mmap
by mapping a special zero page of all zeros. When (if ever) the memory is written, does copy-on-write actually allocate a new page. So the allocation of large, zeroed memory regions may be for free since the OS already needs to zero the pages.
In that case, implementing your sparse set on top of calloc
could be just as fast as the nominally uninitialized version, while being safe and standards compliant.
You should of course test to ensure that calloc
is behaving as expected. The optimized behavior is usually only going to occur when your program initializes a lot of long-lived zeroed memory approximately "up-front". That is, the typical logic for optimized calloc if something like this:
calloc(N)
if (can satisfy a request for N bytes from allocated-then-freed memory)
memset those bytes to zero and return them
else
ask the OS for memory, return it directly because it is zeroed
Basically, the malloc
infrastructure (which also underlies new
and friends) has a (possibly empty) pool of memory that it has already requested from the OS and generally tries to allocated there first. This pool is composed of memory from the last block request from the OS but not handed out (e.g., because the user requested 32 bytes but the allocated asks for chunks from the OS in 1 MB blocks, so there is a lot left over), and also of memory that was handed out to the process but subsequently returned via free
or delete
or whatever. The memory in that pool has arbitrary values, and if a calloc
can be satisfied from that pool, you don't get your magic, since the zero-init has to occur.
On the other hand if the memory has to be allocated from the OS, you get the magic. So it depends on your use case: if you are frequently creating and destroying sparse_set
objects, you will generally just be drawing from the internal malloc
pools and will pay the zeroing costs. If you have a long-lived sparse_set
objects which take up a lot of memory, they likely were allocated by asking the OS and you got the zeroing nearly for free.
The good news is that if you don't want to rely on the calloc
behavior above (indeed, on your OS or with your allocator it may not even be optimized in that way), you could usually replicate the behavior by mapping in /dev/zero
manually for your allocations. On OSes that offer it, this guarantees that you get the "cheap" behavior.
For a solution that is totally platform agnostic you could simply use yet another array which tracks the initialization state of the array.
First you choose some granule, at which you will track initialization, and use bitmap where each bit tracks the initialization state of that granule of the sparse
array.
For example, let's say you choose your granule to be 4 elements, and the size of the elements in your array is 4 bytes (e.g., int32_t
values): you need 1 bit to track every 4 elements * 4 bytes/element * 8 bits/byte, which is an overhead of less than 1%3 in allocated memory.
Now you simply check the corresponding bit in this array before accessing sparse
. This adds some small cost to accessing the sparse
array, but doesn't change the overall complexity, and the check is still quite fast.
For example, your is_member
function now looks like:
bool sparse_set::is_member(size_t i){
bool init = is_init[i >> INIT_SHIFT] & (1UL << (i & INIT_MASK));
return init && sparse[i] < n && dense[sparse[i]] == i;
}
The generated assembly on x86 (gcc) now starts with:
mov rax, QWORD PTR [rdi+24]
mov rdx, rsi
shr rdx, 8
mov rdx, QWORD PTR [rax+rdx*8]
xor eax, eax
bt rdx, rsi
jnc .L2
...
.L2:
ret
That's all associate with the bitmap check. It's all going to be pretty quick (and often off the critical path since it isn't part of the data flow).
In general, the cost of this approach depends on the density of your set, and what functions you are calling - is_member
is about the worse case for this approach since some functions (e.g., clear
) aren't affected at all, and others (e.g., iterate
) can batch up the is_init
check and only do it once every INIT_COVERAGE
elements (meaning the overhead would be again ~1% for the example values).
Sometimes this approach will be faster than the approach suggested in the OP's link, especially when the handling elements not in the set - in this case, the is_init
check will fail and often shortcut the remaining code, and in this case you have a working set that is much smaller (256 times using the example granule size) than the size of the sparse
array, so you may great reduce misses to DRAM or outer cache levels.
The granule size itself is an important tunable for this approach. Intuitively, a larger granule size pays a larger initialization cost when the an element covered by the granule is accessed for the first time, but saves on memory and up-front is_init
initialization cost. You can come up with a formula that finds the optimum size in the simple case - but the behavior also depends on the "clustering" of values and other factors. Finally, it is totally reasonable to use a dynamic granule size to cover your bases under varying workloads - but it comes at the cost of variable shifts.
It is worth noting that there is a similarity between the calloc
and lazy init solutions: both lazily initialize blocks of memory as they are needed, but the calloc
solution track this implicitly in hardware through MMU magic with page tables and TLB entries, while the lazy init solution does it in software, with a bitmap explicitly tracking which granules have been allocated.
The hardware approach has the advantage of being nearly free in (for the "hit" case, anyway) since it uses the always-present virtual memory support in the CPU to detect misses, but the software case has the advantage of being portable and allowing precise control over the granule size etc.
You can actually combine these approaches, to make a lazy approach that doesn't use a bitmap, and doesn't even need the dense
array at all: just allocate your sparse
array with mmap
as PROT_NONE
, so you fault whenever you read from an un-allocated page in a sparse
array. You catch the fault and allocate the page in the sparse
array with a sentinel value indicating "not present" for every element.
This is the fastest of all for the "hot" case: you don't need any of the ... && dense[sparse[i]] == i
checks any more.
The downsides are:
1 This implementation has no "range checking" - i.e., it doesn't check if i
is greater than MAX_ELEM
- depending on your use case you may want to check this. My implementation used a template parameter for MAX_ELEM
, which may result in slightly faster code, but also more bloat, and you'd do fine to just make the max size a class member as well.
2 Really, the only requirement that you use something that calls calloc
under the covers or performs the equivalent zero-fill optimization, but based on my tests more idiomatic C++ approaches like new int[size]()
just do the allocation followed by a memset
. gcc
does optimize malloc
followed by memset
into calloc
, but that's not useful if you are trying to avoid the use of C routines anyway!
3 Precisely, you need 1 extra bit to track every 128 bits of the sparse
array.