The issue mentioned in the push function of a lock-free queue in "C++ Concurrency in Action" Section 7.15

I read in "C++ Concurrency in Action" Listing 7.15:

Using the reference-counting scheme avoids this particular race, but it’s not the only race in push(). If you look at the revised version of push() in listing 7.15, you’ll see a pattern you saw in the stack: load an atomic pointer (1) and dereference (2) that pointer. In the meantime, another thread could update the pointer (3), eventually leading to the node being deallocated (in pop()). If the node is deallocated before you dereference the pointer, you have undefined behavior. Ouch! It’s tempting to add an external count in tail the same as you did for head, but each node already has an external count in the next pointer of the previous node in the queue. Having two external counts for the same node requires a modification to the reference-counting scheme to avoid deleting the node too early. You can address this by also counting the number of external counters inside the node structure and decreasing this number when each external counter is destroyed (as well as adding the corresponding external count to the internal count). If the internal count is zero and there are no external counters, you know the node can safely be deleted. This is a technique I first encountered through Joe Seigh’s Atomic Ptr Plus Project (http://atomic-ptr-plus.sourceforge.net/). The following listing shows how push() looks under this scheme.

Listing 7.15 A (broken) first attempt at revising push():

void push(T new_value)
{
    std::unique_ptr<T> new_data(new T(new_value));
    counted_node_ptr new_next;
    new_next.ptr = new node;
    new_next.external_count = 1;
    for (;;)
    {
        node* const old_tail = tail.load(); // 1
        T* old_data = nullptr;
        if (old_tail->data.compare_exchange_strong(
                old_data, new_data.get())) // 2
        {
            old_tail->next = new_next;
            tail.store(new_next.ptr); // 3
            new_data.release();
            break;
        }
    }
}

However, I don’t quite understand this part:

If the node is deallocated before you dereference the pointer, you have undefined behavior.

In pop(), the function first checks if head and tail point to the same placeholder node. If they do, it determines the queue is empty and does nothing. In Listing 7.15, the push() function updates tail only at the last step, meaning that pop() wouldn’t operate on the node being enqueued until push() completes the enqueue operation. So, where does this race condition come from?

Since the author did not provide an example of the assumption for this situation in the book (the example is based on the previous lock-free stack), I cannot provide the complete context. Below is the code for pop() before the modification.

template <typename T>
class lock_free_queue {
private:
    struct node {
        std::shared_ptr<T> data;
        node *next;
        node() : next(nullptr) {}
    };

    std::atomic<node *> head;
    std::atomic<node *> tail;

node *pop_head() {
    node *const old_head = head.load();
    if (old_head == tail.load()) {
        return nullptr;
    }
    head.store(old_head->next);
    return old_head;
}

public:
    lock_free_queue() : head(new node), tail(head.load()) {}

    // ...

    std::shared_ptr<T> pop() {
        node *old_head = pop_head();
        if (!old_head) {
            return std::shared_ptr<T>();
        }
        std::shared_ptr<T> const res(old_head->data);
        delete old_head;
        return res;
    }

    void push(T new_value) {
        std::shared_ptr<T> new_data(std::make_shared<T>(new_value));
        node *p = new node;
        node *const old_tail = tail.load();
        old_tail->data.swap(new_data);
        old_tail->next = p;
        tail.store(p);
    }
    
};

Solution

To imagine what can go wrong, think that after each instruction the thread could be suspended for a very long time.

void push(T new_value)
{
  ...
  node* const old_tail = tail.load(); // 1
  T* old_data = nullptr;
  // thread suspended here for 1 second
  // thread wakes up later and old_tail is dangling
  if (old_tail->data.compare_exchange_strong(
      old_data, new_data.get())) // 2
   ...
}

If this thread got suspended before it dereferenced the pointer and you have 2 other threads doing push and pop, the pointer that you have on the suspended thread's stack will be dangling when the thread wakes up.

loading a pointer and trying to modify the data it points to in any way (including incrementing an internal refcount) won't be atomic, and therefore cannot work in a lock-free queue, this is later solved by doing CAS on both the pointer and an external refcount before dereferencing the pointer, which is used in some atomic_shared_ptr implementations. (but not the standard library one)