multithreadinglockingx86-64data-racewindows64

Is LOCK prefix mandatory for modifying byte-length variables in a thread-safe manner?


For the sake of simplicity let’s assume we have exactly 8 threads and a byte array of exactly 8 bytes length. Each thread is assigned a byte from this array – that is, the thread can modify freely the assigned byte and none of other bytes from the array.

Let’s assume as well that the array is aligned on 8 bytes boundary.

At first sight it would be thread safe to let threads modify their (and only their) bytes ad libitum as there is actually no shared data here. But – as I understand – all current Intel and AMD processors running 64 bit Windows can read and write just no less than 8 bytes (64 bits) once. So I suppose when modifying just 1 byte from an aligned block of 8 bytes the CPU reads all 8 bytes, modifies the byte in question and writes back the 1 modified byte together with the 7 unmodified bytes. This is everything but thread-safe so I suspect a LOCK prefix would be necessary when writing these bytes directly.

Though I really hope I’m wrong. Any ideas?


Solution

  • No, lock is not needed in this situation.

    The basic unit of memory in the x86 architecture is the byte. Accesses to distinct bytes are fully independent of one another, as far as software needs to be concerned.

    This is a fundamental enough concept that it's actually hard to find it explicitly stated in official documentation. But for instance, in the Intel Software Developer's Manual, volume 3A, section 8.1.1, there is a statement that reading or writing a byte is "always carried out atomically". This includes loads and stores via plain mov, which can't even take a lock prefix.

    The question is only about x86, but this is true on every modern architecture (DEC Alpha having been the most recent notable exception). It's practically essential in order to support the C11/C++11 memory model, which specifies that concurrent accesses to different objects (including different bytes of the same array) occur independently and do not cause a data race.

    But – as I understand – all current Intel and AMD processors running 64 bit Windows can read and write just no less than 8 bytes (64 bits) once.

    Architecturally (i.e. at the level that specifies the software-observable behavior of instructions), this is simply not true. From the perspective of software, all x86 machines can read and write single bytes, without interfering in any way with accesses to other bytes from other cores.

    Microarchitecturally (i.e. at the level of hardware implementation details), what cores read and write is actually even larger than 64 bits - it's entire cache lines (typically 64 bytes). However, there are cache coherency protocols (e.g. MESI) which ensure that two cores never make conflicting writes to the same cache line.

    So if core A is trying to write to byte 0 of a given cache line, and core B is simultaneously trying to write to byte 1, one of them will be first to hold the cache line in an exclusive state; let's say it's core A. A will do its write while B waits. Afterward, the modified cache line (including the updated value of byte 0) will be transferred to core B as it acquires ownership of the line. It can then make its write to byte 1. Any core which acquires the cache line later will see the updated versions of both bytes, and there is no possibility for one of them to be permanently lost, as you might otherwise worry about.

    This does mean that it is less efficient to have separate cores accessing different bytes within the same cache line, a phenomenon known as false sharing. And so for performance, you might want to arrange it so that data accessed by separate threads is in separate cache lines. But it is not at all an issue for correctness - even in the presence of false sharing, the code operates exactly the same, only more slowly.