Guava's ImmutableSet
seems to perform quite poorly in my benchmark concerning contains
. For some sizes it gets even much slower than List
:
size benchmark ns linear runtime
100000 ListContains 110279.54 ==
100000 SetContains 7.15 =
100000 ImmutableSetContains 76716.47 =
200000 ListContains 275367.66 =====
200000 SetContains 7.34 =
200000 ImmutableSetContains 322185.50 ======
500000 ListContains 935210.10 ====================
500000 SetContains 7.79 =
500000 ImmutableSetContains 1382765.76 ==============================
Basically, I fill a set with a few thousands negative integers and test contains with non-negative ones. The code is trivial but a bit too long for pasting in a small text area, so please look here.
I wonder what's going on here. Probably, I hit some degenerate case, although I obviously didn't try to. Or maybe I've just blew the benchmark. Otherwise, I wonder if it can and should be fixed.
The solution was to change the smearing function by replacing
hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
return hashCode ^ (hashCode >>> 7) ^ (hashCode >>> 4);
by
return C2 * Integer.rotateLeft(hashCode * C1, 15);
This takes about the same time and might have some disadvantage, but solves the current problem by nicely spreading the hashes.
The explanation is actually very simple. Imagine the integers lie in the set M = {0, 1, ..., N-1}
for some N
, which is a power of two. And so do the hashes, because of how Integer.hashCode
is defined. The hashes get processed via a function smear
identical to this one in order to minimize collision in the lowest bits in some common cases.
A table of size 2*N
gets allocated and the members of M
get put into it. As smear
involves only right shifts, it maps M
onto itself, which means that a continuous range of the table gets filled. So lets say that all slots in the left half of the table are used and the other half is unused.
When contains(o)
gets invoked, the search starts in a slot which position is determined by o.hashCode()
. If o
gets found, the result is true
, if an empty slot gets hit, the result is false
. Otherwise, the search proceeds to another slot. In order to minimize cache misses, linear probing gets used.
When we are unlucky enough to start the search at the first used slot, all of them have to be traversed, which means N
steps. Starting at a random position means N/4
steps on the average.
What happened in my benchmark is not exactly as above, but the reason for its bad performance is the same. Making the sizes to powers of two makes the problem worse.