Hazard pointers are a technique for safely reclaiming memory in lock-free code without garbage-collection.
The idea is that before accessing an object that can be deleted concurrently, a thread sets its hazard pointer to point to that object. A thread that wants to delete an object will first check whether any hazard pointers are set to point to that object. If so, deletion will be postponed, so that the accessing thread does not end up reading deleted data.
Now, imagine our deleting thread starts to iterate the list of hazard pointers and at the i+1
element it gets preempted. Now another thread sets the hazard pointer at i
to the object that the deleting thread is currently trying to delete. Afterwards, the deleting thread resumes, checks the rest of the list, and deletes the object, even though there is now a hazard pointer at position i
pointing to the object.
So clearly just setting the hazard pointer is not enough, as a deleting thread might already have checked our hazard pointer and decided that our thread does not want to access the object. How can I make sure, after setting a hazard pointer, that the object I'm trying to access won't be deleted from under my hands?
The original paper by Maged M. Michael places this important restriction on algorithms using hazard pointers:
The methodology requires lock-free algorithms to guarantee that no thread can access a dynamic node at a time when it is possibly removed from the object, unless at least one of the thread’s associated hazard pointers has been pointing to that node continuously, from a time when the node was guaranteed to be reachable from the object’s roots. The methodology prevents the freeing of any retired node continuously pointed to by one or more hazard pointers of one or more threads from a point prior to its removal.
As pointed out in Anton's answer, deletion is a two-phase operation: First you have to 'unpublish' the node, remove it from the data structure so that it can no longer be accessed from the public interface.
At this point, the node is possibly removed, in Michael's terms. It is no longer safe for concurrent threads to access it (unless they already have been holding a hazard pointer to it throughout).
Thus, once a node is possibly removed, it is safe for the deleting thread to iterate the list of hazard pointers. Even if the deleting thread gets preempted, a concurrent thread may not access the node anymore. After verifying that no hazard pointers are set to the node, the deleting thread can safely proceed to the second phase of deletion: The actual deallocation.
In summary, the order of operations for the deleting thread is
D-1. Remove the node from the data structure.
D-2. Iterate the list of hazard pointers.
D-3. If no hazards were found, delete the node.
The real algorithm is slightly more involved, as we need to maintain a list of those nodes that cannot be reclaimed and ensure that they get deleted eventually. This has been skipped here, as it is not relevant to explain the issue raised in the question.
Setting the hazard pointer is not enough to guarantee safe access to it. After all, the node might be possibly removed by the time we set our hazard pointer.
The only way to ensure safe access is if we can guarantee that our hazard pointer has been pointing to that node continuously, from a time when the node was guaranteed to be reachable from the object’s roots.
Since the code is supposed to be lock-free, there is only one way to achieve this: We optimistically set our hazard pointer to the node and then check whether that node has been marked as possibly deleted (that is, it is no longer reachable from the public root) afterwards.
Thus the order of operations for the accessing thread is
A-1. Obtain a pointer to the node by traversing the data structure.
A-2. Set the hazard pointer to point to the node.
A-3. Check that the node is still part of the data structure.
That is, it has not been possibly removed in the meantime.
A-4. If the node is still valid, access it.
After a node has been possibly removed (D-1
), the deleting thread could be preempted. Thus it is still possible for concurrent threads to optimistically set their hazard pointer to it (even though they are not allowed to access it) (A-2
).
Therefore, the deleting thread might detect a spurious hazard, preventing it from deleting the node right away, even though none of the other threads will access the node anymore. This will simply delay deletion of the node in the same way a legitimate hazard would.
The important point is that the node will still be deleted eventually.
An accessing thread may get preempted by a deleting thread before verifying that the node has not been potentially removed (A-3
). In such a case, it is no longer allowed to access the object.
Note that in case the preemption occurs after A-2
, it would even be safe for the accessing thread to access the node (since there was a hazard pointer pointing to the node throughout), but since it is impossible for the accessing thread to distinguish this case, it must fail spuriously.
The important point is that a node will only ever be accessed if it has not been deleted.