phplinuxconcurrencyatomicmemcpy

Is an 8 byte memcpy() atomic on a 64bit linux machine?


I'm using an 8 byte shared memory segment in PHP, using shmop_* functions. By looking at PHP's source code, I see that internally shmop_write() and shmop_read() use memcpy() to fill/read those 8 bytes.

I wonder if on a 64bit linux machine memcpy() is smart enough to copy the entire double word in one go (one instruction), thus making the operations of reading and writing effectively atomic.

I'm not sure if these shared memory segments are always 64bit-aligned either.

As an example:

$shmop = shmop_open(ftok(__FILE__, 'R'), 'c', 0644, 8);

shmop_write($shmop, "abcdefgh", 0); // <= is this operation atomic
$a = shmop_read($shmop, 0, 8); // <= is this operation atomic


Solution

  • If the memcpy you're talking about is the C function, glibc's x86-64 assembly memcpy for size=8 branches to the 8-15 case, which does two fully-overlapping 8-byte copies. (https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#308).

    For larger sizes up to 15, they'd overlap in the middle by fewer bytes. So shmop_write could happen both before and after another store. For shmop_read, only the second read matters; the result of the first is overwritten, so it's effectively like C++ std::atomic .load(acquire) on x86-64.

    This is an implementation detail that's not guaranteed by anything. The choice makes sense, though: if they used a size=5-8 range that either partially overlapped or not at all for two 4-byte copies, an 8-byte reload of the result would have a store-forwarding stall since it overlaps two separate stores.


    I don't know what MUSL does, or other C libraries or other architectures.

    Other architectures with efficient unaligned loads / stores (like AArch64) might do something similar for sizes less than 2x an integer register. But this isn't guaranteed. If it does use this strategy, it would be like .load(relaxed) since AArch64 is weakly-ordered.

    This of course assumes the shared memory is sufficiently aligned for plain loads/stores to be atomic on your target machine. (The local variable you're copying from/to doesn't matter since it's not shared. So as Barmar pointed out in comments, "abcdefgh" might not be aligned, but that doesn't matter because you don't need atomic access to it, only for storing those 8 bytes from a register once they make it there.)

    On Intel CPUs, that means the 8 bytes of shared memory must not span a cache-line boundary, but can be misaligned anywhere within a 64-byte block. On AMD, it must be 8-byte aligned. (On more recent AMD CPUs, misaligned within a 16 or 32, or maybe even 64 byte block will still give atomic loads/stores.) See Why is integer assignment on a naturally aligned variable atomic on x86?

    On other architectures, you often need natural alignment for atomicity guarantees, so that's your best bet.