c++multithreading atomic intel memory-alignment

Why misaligned access to locked memory read intermediate value on intel cpu?

I'm studying misaligned memory access now.

According to intel document, Lock instruction use bus-lock to the memory across two cache-line. (I have a 'core i7' cpu.)

9.1.2 Bus Locking

Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link. Assertion of this signal is called a bus lock. While this output signal is asserted, requests from other processors or bus agents for control of the bus are blocked. Software can specify other occasions when the LOCK semantics are to be followed by prepending the LOCK prefix to an instruction.

9.1.2.2 Software Controlled Bus Locking

The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand.

Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor.

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

I made multi-thread test code as below, And the result is not as I expected.

I expect that using InterlockedCompareExchange(using Lock cmpxchg instruction) make the bus locked, so reading the value(using mov instruction) is also must be serialized. But It read intermediate value such as 0xFFFFFFFFAAAAAAAA and 0xAAAAAAAAFFFFFFFF.

Can you explain why? Do I misunderstand bus lock or anything else?

[test summary] Read and write to the memory address which is across two cache-line.

thread1: CompareAndExchange to NUM1(0xAAAAAAAAAAAAAAAA)

thread2: CompareAndExchange NUM2(0xFFFFFFFFFFFFFFFF)

thread3: read and check whether the value is not NUM1 and NUM2.

result:

#include <iostream>
#include <Windows.h>
#include <thread>

constexpr unsigned __int64 NUM1 = 0xAAAAAAAAAAAAAAAA;
constexpr unsigned __int64 NUM2 = 0xFFFFFFFFFFFFFFFF;

inline bool IsGarbageValue(unsigned __int64 value)
{
  return (value != NUM1 && value != NUM2);
}

#pragma pack(push, 1)
struct alignas(64) TestValue
{
  char pad[60];
  unsigned __int64 value;
};
#pragma pack(pop)

int main()
{
  TestValue test;
  test.value = NUM1;
  std::thread t1([&test]()
    {
      while (1)
      {
        auto org = InterlockedCompareExchange(&test.value, NUM1, NUM2);
      }
    });

  std::thread t2([&test]()
    {
      while (1)
      {
        auto org = InterlockedCompareExchange(&test.value, NUM2, NUM1);
      }
    });

  std::thread t3([&test]()
    {
      while (1)
      {
        if (::IsGarbageValue(test.value))
        {
          std::cout << std::hex << test.value << std::endl;
        }
      }
    });
  t1.join();
  t2.join();
  t3.join();
}

Solution

so reading the value(using mov instruction) is also must be serialized

A cache-line split load uop needs two separate cache accesses in separate cycles to get both parts of the value.

Intel load execution units have some number of split-load buffers to allow multiple split loads to be in-flight at once; a cache line split can only be detected after address-generation, and the CPU is optimized for the non-split case. So the scheduler could have sent it another load uop in the next cycle. Also, one or the other of the cache lines might not be valid in cache, in which case it'll be many cycles before the other half arrives. (And the execution unit can run other load uops in the meantime.)

Anyway, the bus-lock required for a split lock means that neither part of a plain load can happen during a lock cmpxchg, but nothing stops a lock cmpxchg from happening between the pieces of a non-atomic load. The "serialization" you're talking about is on individual accesses to cache, not whole instructions.

There's no way to make a misaligned pure-load atomic on x86 (except with transactional memory (RTM) which is only supported on a few CPUs; Intel keeps disabling it with microcode updates after finding bugs or vulnerabilities in it or caused by it, so OSes may also disable it.) The lock prefix doesn't apply to mov, only to some memory-destination RMW instructions.

To test atomicity of misaligned locked operations, test the return value of InterlockedCompareExchange - that gives you the old value of the location as seen by the CAS attempt, whether it succeeds or fails. i.e. it's the load side of the atomic RMW. (Docs)

Have the two writer threads do ::IsGarbageValue(org) and remove the third thread.