Boost.Unordered unordered_flat_map: why does it need sentinel?

I've been going through Boost's unordered_flat_map implementation. Here's an overview:

https://www.boost.org/doc/libs/latest/libs/unordered/doc/html/unordered/structures.html#structures_open_addressing_containers

I also read the source code a little bit, but not fully. The metadata bytes use the value 1 to indicate so called "sentinel" byte. As far as I understand, this sentinel byte is used for iterators to know where to stop. Quoting the doc above:

The value of h_i is:

...

1 to encode a special empty bucket called a sentinel, which is used internally to stop iteration when the container has been fully traversed.

Here "h_i" is the metadata byte I am talking about. Next, from the source code I can see that there is only one sentinel byte in the entire hash map, and it is at the end of the map (which does not necessarily mean the end of the allocated memory). Please do correct me if I'm wrong.

It makes me wonder why we need it to begin with. The hash map has to track the number of buckets or groups of buckets, otherwise calculating a group index from the hash wouldn't be possible.
Therefore we know in advance how many buckets we have, and so we know when iteration should end, without the need of sentinel byte.
And as far as I understand, we still need to inspect all the buckets, or rather their metadata. In order to know whether the bucket holds a value or is empty.

The situation seems similar to strings ending with \0 byte vs keeping track of the string's length (in which case we don't need the \0 marker).

Is the sentinel byte some kind of optimization? If so then what does it optimize?

Am I missing something here? Is there another reason for the sentinel byte? Can we implement the flat map without the sentinel, and with zero being the only special metadata byte?

EDIT: It might be the case that when we grow the table, and add X groups, we end up with container of size 2*X (the unordered_flat_map seems to double its size on grow). Sentinel at the end of the X'th group means that we don't have to traverse the remaining X groups, which means that the iteration is more or less twice as fast. However, this assumes that after growing the table, and after rehashing it we still end up with buckets concentrated in the first half of the table. Why would that be? Doesn't this actually tells us that we use a flawed hashing function? Don't we actually want them to be more or less uniformly distributed in the table? In which case we still need to traverse most, if not all bucket groups. Does it really matter if we skip couple of buckets during the iteration of say thousands of buckets?

Solution

(This is a very detailed explanation of how the hashmap works, by the author Joaquín M López Muñoz, and well worth a read as background, although it doesn't talk about the use of the sentinel in any significant detail.)

The key feature of the hashmap is the use of SIMD instructions to efficiently probe additional metadata to speed-up lookups significantly. The speed-up is used in several ways including:

to assess for a possible key match (match of metadata against a reduced hash)
to find an empty slot (match against the special metadata value 0).

SIMD instructions are also used to speed up iteration. Efficiently iterating over such a hashmap is complicated as the data is sparsely arranged in the backing buffer. (For a start (!), the STL mandates constant time of begin(), which isn't really possible, because begin() has to search for the first element.)

The increment() operation on the iterator (in detail/foa/table.hpp, starting on line 153 in my version of Boost) shows how this works. First the current group is searched for the next non-empty bucket (this is just done linearly; presumably for a short buffer this is more efficient than going to SIMD). If the end of the current group is reached (and the sentinel is not encountered), SIMD instructions are used to efficiently check the next group for a non-zero bucket [note, from the link above, there are no tombstones so this will always distinguish an empty vs truly occupied bucket]. For a very sparsely occupied hashmap this will very efficiently traverse the whole buffer, as each empty group can be very quickly skipped over. By having the sentinel as a non-zero value, this scheme works without having to have additional branching logic to check for the size of the hashmap on each iteration.

Now, permanently reserving a sentinel flag for this does come at a cost -- the trivial cost of wasting one slot, but perhaps more importantly the degradation of the reduced hash (loss of 0.0056bits). Given the attention to detail taken by Joaquin in writing this hashmap, I think it is safe to assume that this was very carefully considered and profiled, so whilst it might be true that the hashmap doesn't need to use a sentinel, it can certainly make use of it.