javahashhashmaphashtablehashcode

Should hash code formulas change based on the variables ranges?


I have a basic class called Dice that represents groups of dice. The format I use for them is D&D's format of NdX. Where N is the number of dice and X is the number of sides each dice has. EG. 5d6 is a group of 5 dice each with 6 sides.

I haven't done anything with hash codes before, and I thought this would be decent practice. From what I understand, every Dice that is equal using the equals() method must have the same hashcode, and hashcodes should be "evenly distributed". I have found the following way of computing hascodes.

@Override
public int hashCode(){
    return Objects.hash(numSides, numDice);
}
// algo from https://stackoverflow.com/a/18066516/23651145

Both the number of dice and the number of sides on each dice can go up to the integer limit, but in practice they probably won't exceed 10 and 100 respectively.

I am looking for optimal and evenly distributed function. Should I be taking their respective ranges into account when creating a hash code formula? Does the order of the variables matter? How would I test how good a given hash code formula is?


Solution

  • As always, it is a good idea to read the JavaDoc for the method you are trying to implement. It says:

    Returns a hash code value for the object. This method issupported for the benefit of hash tables such as those provided by java.util.HashMap.

    The general contract of hashCode is:

    • Whenever it is invoked on the same object more than once duringan execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
    • If two objects are equal according to the equals method, then calling the hashCode method on each of the two objects must produce the same integer result.
    • It is not required that if two objects are unequal according to the equals method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.

    That is, how Dice implements hashCode only matters if we use a HashSet<Dice> or HashMap<Dice, Whatever>.

    Distinct hashCode are not required for correctness. The only benefit of distinct hashCode is that it "may improve the performance of hashtables". To expand on that a little, a HashSet (or HashMap) works by having a table of buckets, and stuffing each object into the bucket that corresponds the object's hashCode. When looking for an object, it can then compute the object's hashCode, and only look through the objects in that bucket, rather all objects in the HashMap, which can improve performance if the number of objects in the bucket is much smaller than the number of objects in the HashMap.

    With that infodump out of the way, let's look at your example. Your Dice class represents "groups of dice". Do you anticipate using a HashSet<Dice> or a HashMap<Dice, Whatever> anywhere? And do you anticipate stuffing many different Dice objects into that Set or Map? And do you anticipate these Set or Map to be used often enough to be a performance bottleneck for your application?

    If you answered no to any of these questions, you don't need distinct hashCodes.

    Still here? You have one of these rare cases where the performance of your HashMaps is actually crucial? In that case, the best thing to do is to measure the performance of the code that interacts with the HashSet, and see whether different implementations make a difference.

    Let's do that for a toy example:

    abstract class Dice {
        int numDice;
        int numSides;
    
        @Override
        public boolean equals(Object d) {
            return d instanceof Dice //
                    && ((Dice)d).numDice == numDice //
                    && ((Dice)d).numSides == numSides;
        }
    }
    
    class StupidDice extends Dice {
        @Override
        public int hashCode() {
            return 42;
        }
    }
    
    class SimpleDice extends Dice {
        @Override
        public int hashCode() {
            return Objects.hash(numDice, numSides);
        }           
    }
    
    class SimpleDice2 extends Dice {
        @Override
        public int hashCode() {
            return Objects.hash(numSides, numDice);
        }           
    }
    
    class TunedDice extends Dice {
        @Override
        public int hashCode() {
            return 101 * numDice + numSides;
        }
    }
    
    class TunedDice2 extends Dice {
        @Override
        public int hashCode() {
            return 11 * numSides + numDice;
        }   
    }
    
    public class HashCodeBenchmark extends Benchmark {
        static <D extends Dice> void b(String name, Supplier<D> constructor) {
            benchmark(name, (iterations) -> {
                var set = new HashSet<D>();
                for (int i = 0; i < iterations; i++) {
                    set.clear();
                    for (int s = 1; s <= 100; s++) {
                        for (int n = 1; n <= 10; n++) {
                            var d = constructor.get();
                            d.numDice = n;
                            d.numSides = s;
                            set.add(d);                     
                        }
                    }
                }
                return set;         
            });
        }
        
        public static void main(String[] args) {
            b("StupidDice", StupidDice::new);
            b("SimpleDice", SimpleDice::new);
            b("SimpleDice2", SimpleDice2::new);
            b("TunedDice", TunedDice::new);
            b("TunedDice2", TunedDice2::new);
        }
    }
    
    class Benchmark {
    
        public static void benchmark(String label, Code code) {
            print(25, label);
            
            try {
                for (int iterations = 1; ; iterations *= 2) { // detect reasonable iteration count and warm up the code under test
                    System.gc(); // clean up previous runs, so we don't benchmark their cleanup
                    long start = System.nanoTime();
                    code.execute(iterations);
                    long duration = System.nanoTime() - start;
                    
                    if (iterations > 1E8 || duration > 1E9) { 
                        print(27, new BigDecimal(duration * 1000 / iterations).movePointLeft(3) + " ns / iteration");
                        return;
                    }
                }
            } catch (Throwable e) {
                throw new RuntimeException(e);
            }
        }
        
        private static void print(int desiredLength, String message) {
            System.out.print(" ".repeat(Math.max(1, desiredLength - message.length())) + message);
        }
        
        @FunctionalInterface
        interface Code {
            /**
             * Executes the code under test.
             * 
             * @param iterations
             *            number of iterations to perform
             * @return any value that requires the entire code to be executed (to
             *         prevent dead code elimination by the just in time compiler)
             * @throws Throwable
             *             if the test could not complete successfully
             */
            Object execute(int iterations);
        }
    }
    

    If I run that, I get:

                   StupidDice 3718934.375 ns / iteration
                   SimpleDice   11111.906 ns / iteration
                  SimpleDice2   16974.716 ns / iteration
                    TunedDice   12781.292 ns / iteration
                   TunedDice2   12310.991 ns / iteration
    

    That is, stuffing 1000 Dice into a Set took a mere 3.718 ms even with the worst possible hashCode() implementation, while reasonable hashCode() implementations took between 0.011 ms and 0.017 ms.

    That is, for many applications, even the worst possible hashCode() implementation is fast enough, and the differences between reasonable implementations are so small that they matter only in the rarest of cases.

    You should therefore prioritize ease of implementation over performance unless you have evidence of it being an actual performance bottleneck.

    Speaking of ease of implementation: In modern Java, we'd likely declare Dice as a record rather than a class to have the compiler implement constructors, equals, hashCode and toString for us:

    record Dice(int numDice, int numSides) {}