ccpu-architectureatomichardware

Can atomic operations of different sizes be mixed?


For the same memory address, if I use atomic operations of different widths to operate on it (assuming the memory is aligned), for example(Assuming the hardware supports 128 bit atomic operations):

#include <stdatomic.h>
#include <stdint.h>
#include <assert.h>

#include <pthread.h>

struct vec {
    uint64_t low;
    uint64_t high;
};
union test {
    _Atomic(uint64_t) vec[2];
    _Atomic struct vec atomic;
};

static int val = 0;
static union test vec0 = { .vec = {2, 3} };
static union test vec1 = { .vec = {2, 3} };
static union test vec2 = { .vec = {2, 3} };

static void *thread0_0(void *arg)
{
    struct vec tmp = atomic_load_explicit(&vec0.atomic, memory_order_relaxed);

    /* Is assert always true? */
    assert(tmp.low == 2 || tmp.low == 5);
    assert(tmp.high == 3 || tmp.high == 7);
    return NULL;
}
static void *thread0_1(void *arg)
{
    atomic_store_explicit(&vec0.vec[0], 5, memory_order_relaxed);
    return NULL;
}
static void *thread0_2(void *arg)
{
    atomic_store_explicit(&vec0.vec[1], 7, memory_order_relaxed);
    return NULL;
}

static void *thread1_0(void *arg)
{
    struct vec tmp = {5, 7};
    atomic_store_explicit(&vec1.atomic, tmp, memory_order_relaxed);
    return NULL;
}
static void *thread1_1(void *arg)
{
    uint64_t tmp = atomic_load_explicit(&vec1.vec[0], memory_order_relaxed);
    /* Will assert always be true? */
    assert(tmp == 2 || tmp == 5);
    return NULL;
}
static void *thread1_2(void *arg)
{
    uint64_t tmp = atomic_load_explicit(&vec1.vec[1], memory_order_relaxed);
    /* Will assert always be true? */
    assert(tmp == 3 || tmp == 7);
    return NULL;
}

static void *thread2_0(void *arg)
{
    val = 10;
    struct vec tmp = {5, 7};
    atomic_store_explicit(&vec2.atomic, tmp, memory_order_release);
    return NULL;
}
static void *thread2_1(void *arg)
{
    uint64_t tmp = atomic_load_explicit(&vec2.vec[1], memory_order_acquire);
    if (tmp == 7)
        assert(val == 10); /* Will assert always be true? */
    return NULL;
}

int main(void)
{
    pthread_t th[8];

    assert(pthread_create(&th[0], NULL, thread0_0, NULL) == 0);
    assert(pthread_create(&th[1], NULL, thread0_1, NULL) == 0);
    assert(pthread_create(&th[2], NULL, thread0_2, NULL) == 0);

    assert(pthread_create(&th[3], NULL, thread1_0, NULL) == 0);
    assert(pthread_create(&th[4], NULL, thread1_1, NULL) == 0);
    assert(pthread_create(&th[5], NULL, thread1_2, NULL) == 0);

    assert(pthread_create(&th[6], NULL, thread2_0, NULL) == 0);
    assert(pthread_create(&th[7], NULL, thread2_1, NULL) == 0);

    for (size_t i = 0; i < 8; ++i)
        assert(pthread_join(th[i], NULL) == 0);
    return 0;
}

I would like to know:

  1. For the C standard specification, is this an undefined behavior?
  2. Actually, is there any hardware architecture that can cause it to fail?

The background of this question is that I am recently writing lock-free stack and lock-free queue: https://github.com/atomic-kernel/c-cpp-studying/blob/adcc99275430c67555ee7dc31a8bb7c257017810/lockfree/lstack.h

I found that for the lock-free stack, it can omit updating the count(to fit with ABA problem) when pushing, this can achieve some performance improvement, but I'm not sure if this is dangerous.

Addendum:

If mixing Overlapping Atoms of Different Widths will causes problems, on what hardware arch, executing what atomic instruction will cause the problem?

I have read the implementation of the open-source lock-free library liblfds, which also includes many mixed width read and write operations. For example, from src\lfds711_queue_unbounded_manyproducer_manyconsumer\lfds711_queue_unbounded_manyproducer_manyconsumer_enqueue.c, line 61:

    /* read with 64bit */
    if( qumms->enqueue[COUNTER] == enqueue[COUNTER] and qumms->enqueue[POINTER] == enqueue[POINTER] )
    {
      if( next[POINTER] == NULL )
      {
        new_enqueue[COUNTER] = next[COUNTER] + 1;
        LFDS711_PAL_ATOMIC_DWCAS( enqueue[POINTER]->next, next, new_enqueue, LFDS711_MISC_CAS_STRENGTH_WEAK, result );
        if( result == 1 )
          finished_flag = LFDS711_MISC_FLAG_RAISED;
      }
      else
      {
        next[COUNTER] = enqueue[COUNTER] + 1;
        // TRD : strictly, this is a weak CAS, but we do an extra iteration of the main loop on a fake failure, so we set it to be strong
        /* cas with 128 bit */
        LFDS711_PAL_ATOMIC_DWCAS( qumms->enqueue, enqueue, next, LFDS711_MISC_CAS_STRENGTH_STRONG, result );
      }
    }

Solution

  • The short answer is: maybe it does on all architectures available now (see below), but if you need stricter semantic than the language gives you, go ahead and implement your own with assembly or low level cpu specific intrinsics that are guaranteed to have stricter semantic.


    You are able to convert structs to variables through a union, at least for non-atomic types, without causing undefined behavior. For atomic types however the result is unspecified, because _Atomic(uint64_t) does not necessarily have the same memory representation (and alignment requirements) as uint64_t (even in the case where they are of the same size).

    In practice if the atomics are lock free and the size is the same there is no reason to expect them to be different.

    Assuming that your compiler does indeed use the same representation for both types, (and same thing holds for struct {uint64_t, uint64_t}, which requires intrinsic 128 bit atomic operations at least), and that struct { uint64_t, uint64_t} has no padding between the values then yes your tests 0 and 1 will always work.

    As for test number 2 I wouldn't trust it. The way to implement atomic semantic may be significantly different for different sizes (64 bit and 128 bit), and I don't believe the standard explicitly talk about synchronization between data of different sizes. Note that in x86 all read and writes that fit within a cache line have acquire and release semantic, and thus I expect that to work. Similarly for architectures with explicit memory barrier instructions that do not depend on the read size, such as ARM.

    That said if you need to rely on a specific behavior of a specific set of architectures you shouldn't write undefined behavior C code that could change with new but rather write in assembly directly and refer to the processor manual. C atomics are a high level feature whose implementation you are not allowed to assume if you want your code to be portable not only between architectures but also between different compiler versions.

    In your case though you don't really need to have atomic unions and mix read and writes of different sizes, you can just read the whole structure as a 128 bit value and use a compare exchange loop to modify it. Or you relinquish having the length perfectly synchronized with the pointer (may be acceptable, depending on the use you make of it) and do two independent atomic RMW operations on 64 bit memory.