Why does the memory order need to be Acquire in a single consumer linked-list queue when comparing pointer values?

This is a multi-producer single-consumer implementation translated from Rust, for the language-lawyer question, rewriting it in C++

template<class T>
struct Node{
    std::atomic<Node*> next;
    T value;
    Node(T v):value(v),next(){}
};
template<class T>
struct Queue {
    std::atomic<Node<T>*> head;
    Node<T>* tail;
    Queue(){
      auto h = new Node<T>{T{}};
      head.store(h);
      tail = h;
    }
    void push(T t){
       auto node = new Node<T>{t};
       auto pre = this->head.exchange(node,std::memory_order::acq_rel);
       pre->next.store(node,std::memory_order::release);
    }
    T pop(){
       auto tail = this->tail;
       auto next = tail->next.load(std::memory_order::acquire);
       if(next){
          this->tail = next;
          auto ret = next->value;
          delete tail;
          return ret;
       }
       if (this->head.load(std::memory_order::acquire) == tail){  // #1
          throw "empty";
       }else{
         throw "Inconsistent";
       }
    }
};

I wonder why the memory order in #1 needs to be Acquire? The target is only to compare whether the pointer value tail and the pointer value stored in this->head are equal, which does not access the referent pointed to by the pointer value. Unlike the above, which needs to access the referent after the load, this requires the initialization of the node to happen before the access of the node. So, why can't the memory order at #1 be memory_order::relaxed?

Update:

The source code is mpsc_queue.rs. This is how the library comments on when the returned value is "Inconsistent"

PS: Cite the original snippet code here

//! A mostly lock-free multi-producer, single consumer queue.
//!
//! This module contains an implementation of a concurrent MPSC queue. This
//! queue can be used to share data between threads, and is also used as the
//! building block of channels in rust.
//!
//! Note that the current implementation of this queue has a caveat of the `pop`
//! method, and see the method for more information about it. Due to this
//! caveat, this queue might not be appropriate for all use-cases.

// https://www.1024cores.net/home/lock-free-algorithms
//                          /queues/non-intrusive-mpsc-node-based-queue

#[cfg(all(test, not(target_os = "emscripten")))]
mod tests;

pub use self::PopResult::*;

use core::cell::UnsafeCell;
use core::ptr;

use crate::boxed::Box;
use crate::sync::atomic::{AtomicPtr, Ordering};

/// A result of the `pop` function.
pub enum PopResult<T> {
    /// Some data has been popped
    Data(T),
    /// The queue is empty
    Empty,
    /// The queue is in an inconsistent state. Popping data should succeed, but
    /// some pushers have yet to make enough progress in order allow a pop to
    /// succeed. It is recommended that a pop() occur "in the near future" in
    /// order to see if the sender has made progress or not
    Inconsistent,
}

struct Node<T> {
    next: AtomicPtr<Node<T>>,
    value: Option<T>,
}

/// The multi-producer single-consumer structure. This is not cloneable, but it
/// may be safely shared so long as it is guaranteed that there is only one
/// popper at a time (many pushers are allowed).
pub struct Queue<T> {
    head: AtomicPtr<Node<T>>,
    tail: UnsafeCell<*mut Node<T>>,
}

unsafe impl<T: Send> Send for Queue<T> {}
unsafe impl<T: Send> Sync for Queue<T> {}

impl<T> Node<T> {
    unsafe fn new(v: Option<T>) -> *mut Node<T> {
        Box::into_raw(box Node { next: AtomicPtr::new(ptr::null_mut()), value: v })
    }
}

impl<T> Queue<T> {
    /// Creates a new queue that is safe to share among multiple producers and
    /// one consumer.
    pub fn new() -> Queue<T> {
        let stub = unsafe { Node::new(None) };
        Queue { head: AtomicPtr::new(stub), tail: UnsafeCell::new(stub) }
    }

    /// Pushes a new value onto this queue.
    pub fn push(&self, t: T) {
        unsafe {
            let n = Node::new(Some(t));  // #1
            let prev = self.head.swap(n, Ordering::AcqRel);  // #2
            (*prev).next.store(n, Ordering::Release);  // #3
        }
    }

    /// Pops some data from this queue.
    ///
    /// Note that the current implementation means that this function cannot
    /// return `Option<T>`. It is possible for this queue to be in an
    /// inconsistent state where many pushes have succeeded and completely
    /// finished, but pops cannot return `Some(t)`. This inconsistent state
    /// happens when a pusher is pre-empted at an inopportune moment.
    ///
    /// This inconsistent state means that this queue does indeed have data, but
    /// it does not currently have access to it at this time.
    pub fn pop(&self) -> PopResult<T> {
        unsafe {
            let tail = *self.tail.get();
            let next = (*tail).next.load(Ordering::Acquire);  // #4

            if !next.is_null() { // #5
                *self.tail.get() = next;
                assert!((*tail).value.is_none());
                assert!((*next).value.is_some());
                let ret = (*next).value.take().unwrap();  
                let _: Box<Node<T>> = Box::from_raw(tail);
                return Data(ret);
            }

            if self.head.load(Ordering::Acquire) == tail { Empty } else { Inconsistent }
        }
    }
}

impl<T> Drop for Queue<T> {
    fn drop(&mut self) {
        unsafe {
            let mut cur = *self.tail.get();
            while !cur.is_null() {
                let next = (*cur).next.load(Ordering::Relaxed);
                let _: Box<Node<T>> = Box::from_raw(cur);
                cur = next;
            }
        }
    }
}

Solution

I can see one obscure case in which the acquire ordering is relevant.

A throw of Inconsistent does indicate that at least one producer thread has started pushing to the queue. We could imagine a use case where that fact is used for synchronization. In such a case, the acquire ordering would be necessary.

Imagine for instance that this structure is being used with just one producer, and they share some block of (non-atomic) scratch memory. The producer will use the scratch memory to produce some results, but by the time it is ready to begin pushing the results into the queue, the scratch memory is no longer needed. The consumer could begin using it at that time.

A successful pop of a value would indicate that the producer has started pushing and is thus no longer using the scratch memory. However, an inconsistent pop is also evidence of this.

So in principle, we could have:

char scratch_memory[LARGE];
Queue<Data> q;

void producer() {
    read_and_write(scratch_memory);
    q.push(data_value);
}

void consumer() {
    Data d;
    try {
        d = q.pop();
    }
    catch (const char *s) {
        if (strcmp(s, "Inconsistent") == 0) {
            read_and_write(scratch_memory);
            // ...
        } else {
            // empty queue, wait a while and try again
        }
    }
    // otherwise do something with d, and 
}

With the acquire ordering, this code is correct; the Inconsistent throw from pop() would occur only upon having loaded a value from head which is not equal to tail, which could only have been stored by push. That store synchronizes with the load, and so the producer's read_and_write(scratch_memory) happens before the corresponding code in the consumer.

If we did not have the acquire load at #1, this code would have a data race.

Obviously this use case is rather strange. Normally a throw of Inconsistent should just be followed by a retry (perhaps after a short pause or yield().)

If pop() were changed to simply retry indefinitely in the inconsistent case, rather than throwing or otherwise returning, then I don't think there would be any need for the acquire ordering in #1. (But I don't quite have a proof of that yet.)