I'm currently going through the "Rust by Practice" exercises. In section "Collection Types > HashMap" there is an example:
use std::collections::HashMap;
fn main() {
let mut map: HashMap<i32, i32> = HashMap::with_capacity(100);
map.insert(1, 2);
map.insert(3, 4);
// Indeed ,the capacity of HashMap is not 100, so we can't compare the equality here.
assert!(map.capacity() >= 100);
println!("Capacity #1: {}", map.capacity());
// Shrinks the capacity of the map with a lower limit. It will drop
// down no lower than the supplied limit while maintaining the internal rules
// and possibly leaving some space in accordance with the resize policy.
map.shrink_to(50);
assert!(map.capacity() >= 50);
println!("Capacity #2: {}", map.capacity());
// Shrinks the capacity of the map as much as possible. It will drop
// down as much as possible while maintaining the internal rules
// and possibly leaving some space in accordance with the resize policy.
map.shrink_to_fit();
assert!(map.capacity() >= 2);
println!("Capacity #3: {}", map.capacity());
println!("Success!");
}
I get the following output running the above program:
Capacity #1: 112
Capacity #2: 56
Capacity #3: 3
Success!
My question is as follows:
We allocate an empty HashMap with a capacity of 100 (elements) to variable map
. Then we insert two elements into said HashMap. So the whole HashMap has a length of 2 elements. So we have 98 places in which we could insert more elements, right? So why does Rust compiler here expand the capacity to 112?
Also when we shrink the capacity of the HashMap to 50, the real capacity seems to be 53, even though we still only have 2 elements inside it.
And the same goes for the third output. Could someone explain to me what's happening here?
We allocate an empty HashMap with a capacity of 100 (elements) to variable map. Then we insert two elements into said HashMap. So the whole HashMap has a length of 2 elements. So we have 98 places in which we could insert more elements, right? So why does Rust compiler here expand the capacity to 112?
If you check the capacity after allocation, you'll see that it's already 112, it's not the insertion which causes the expansion.
But the answer as to why it's that is actually found in the code, it's not really documented.
First, a hashmap can not be 100% full: in order to limit collisions (and the performance degradation which would ensue) it has to reserve "slack space", the amount of which depends on the implementation of the hashmap (and specifically its collision resolution algorithm). The ratio of "usable" hashmap (minus slack space) to "total" hashmap size is called the "load factor" e.g. if a hashmap has a load factor of 75%, it will at most be 75% full before it resizes (expands).
In Rust's hashmap, the load factor is quite high, nearly 90% (specifically 7/8th, or 87.5%). If you apply this to your request (100 / 0.875), you get 114.29, which... is not what we get. Plus HashMap::capacity
does apply the load factor before returning, so if the internal capacity was 1151, it should return 100.
The second bit is that for reasons of computers and bitmaps, hashbrown rounds the "internal" capacity it gets from the load factor to the nearest power of two2, so when you request enough space for 100 items, first it applies the load factor getting 115, then it rounds that up to 128.
When you ask for the capacity()
it applies the load factor to the real internal capacity, 128 / 8 * 7 = 112, and bob's your uncle.
Same story for 50, divide by load factor to 57, round up to 64, multiply by load factor = 56.
And essentially the same for 2, although it's actually special-cased: an internal capacity under 4 is rounded up to 4 (capacity() = 3), and above 4 but below 8 is rounded up to 8 (capacity() = 7).
You can find the exact code in capacity_to_buckets
.
1: can't have fractional cells in an array so the capacity has to be rounded up
2: if the "real" capacity is a power of two, the hashmap can just store the power of two which is at most 64 (for 264), and thus fits into a single byte, otherwise it would need 8 bytes, it also uses a bitmap to store which entries are full / empty, this way the size of the bitmap always matches the actual capacity