setguavaimmutabilitycaliper

Performance of Guava's ImmutableSet.contains


Guava's ImmutableSet seems to perform quite poorly in my benchmark concerning contains. For some sizes it gets even much slower than List:

  size            benchmark         ns linear runtime
100000         ListContains  110279.54 ==
100000          SetContains       7.15 =
100000 ImmutableSetContains   76716.47 =
200000         ListContains  275367.66 =====
200000          SetContains       7.34 =
200000 ImmutableSetContains  322185.50 ======
500000         ListContains  935210.10 ====================
500000          SetContains       7.79 =
500000 ImmutableSetContains 1382765.76 ==============================

Basically, I fill a set with a few thousands negative integers and test contains with non-negative ones. The code is trivial but a bit too long for pasting in a small text area, so please look here.

I wonder what's going on here. Probably, I hit some degenerate case, although I obviously didn't try to. Or maybe I've just blew the benchmark. Otherwise, I wonder if it can and should be fixed.


The solution was to change the smearing function by replacing

hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
return hashCode ^ (hashCode >>> 7) ^ (hashCode >>> 4);

by

return C2 * Integer.rotateLeft(hashCode * C1, 15);

This takes about the same time and might have some disadvantage, but solves the current problem by nicely spreading the hashes.


Solution

  • The explanation is actually very simple. Imagine the integers lie in the set M = {0, 1, ..., N-1} for some N, which is a power of two. And so do the hashes, because of how Integer.hashCode is defined. The hashes get processed via a function smear identical to this one in order to minimize collision in the lowest bits in some common cases.

    A table of size 2*N gets allocated and the members of M get put into it. As smear involves only right shifts, it maps M onto itself, which means that a continuous range of the table gets filled. So lets say that all slots in the left half of the table are used and the other half is unused.

    When contains(o) gets invoked, the search starts in a slot which position is determined by o.hashCode(). If o gets found, the result is true, if an empty slot gets hit, the result is false. Otherwise, the search proceeds to another slot. In order to minimize cache misses, linear probing gets used.

    When we are unlucky enough to start the search at the first used slot, all of them have to be traversed, which means N steps. Starting at a random position means N/4 steps on the average.

    What happened in my benchmark is not exactly as above, but the reason for its bad performance is the same. Making the sizes to powers of two makes the problem worse.