gohashhash-collisiongob

Hashing multiple values in golang


I'm currently working on an application in go that needs to cache different resources. Different types of resources have handlers that will know what data is relevant to determine, if we have to rebuild a resource or if we can fetch it from cache. For this purpose handlers shall generate a hash of all relevant data for caching. Depending on the context that data can be primitives (int, float, ...), strings, slices, structs and maps. So almost everything. The number of objects used for hashing might also vary.

To calculate that hash in handlers, I've created a hashing function with variadic parameters of type interface{}.

My current approach is this:

func Hash(objs ...interface{})([]byte) {
    // Use MD5 because it's fast and is reasonably enough protected against accidental collisions.
    // There is no scenario here where intentional created collisions could do harm.
    digester := crypto.MD5.New()

    encoder := gob.NewEncoder(digester)
    encoder.Encode(objs) // In real life one would handle that error

    return digester.Sum(make([]byte, 0))
}

This works. But there are a few things that give me headaches with this implementation. Because I'm not sure that gob will always behave deterministic, for the current version that seems to be the case, but as the referenced answer points out, there might be changes between versions. According to the documentation for gob, default values (0 for ints, empty strings, nil, ...) will be omitted when transporting structs. Also all int values will be transmitted as generic numbers. So unit64 and int will be the same. I can't think of an actual issue with this for my usecase, but this smells like a source for trouble.

Now if I would write that function from scratch I would properly play it safe, traverse the structure with reflection and create a hashing tree. But I don't want to do that.

I'm pretty sure I'm not the first one out there with these requirements, but I wasn't able to find any well tested go code on the web, that addresses this issue.

Appendix

Also see: https://crypto.stackexchange.com/questions/10058/how-to-hash-a-list-of-multiple-items

This is not as trivial as it may seem. Simply concatenating data as pointed out by Adrian won't work, because then Hash("12", "3") and Hash("123") will generate the same hash. One possible solution is to prepend the length of data (and the number of elements of lists) before hashing or to create a tree of hashes, but I can't think of a reliable way to do either of that with complex data structures, without writing a lot of reflection code that handles all the different cases (integers, floats, strings, structs, slices, ...) and walks through the entire structure. I want to avoid that, because there are a lot of special cases that can be overseen doing so and it seems unnecessary complicated to me. This is why I tried to solve the problem using serialization, which will take care of most of the problems described above. I'm just not sure if there are not some drawbacks of using gob in this scenario and if there isn't a smarter solution to the problem.


Solution

  • Perhaps to add to add to Adrian's answer. If you add fmt.Fprintf(digester, reflect.TypeOf(ob)) before each object. That would make Hash("12", "3") and Hash("123") different.

    package main

    import (
        "crypto"
        _ "crypto/md5"
        "fmt"
        "reflect"
    )
    
    func main() {
        fmt.Printf("%x\n", Hash("12", "3"))
        fmt.Printf("%x\n", Hash("123"))
    }
    
    func Hash(objs ...interface{}) []byte {
        digester := crypto.MD5.New()
        for _, ob := range objs {
            fmt.Fprint(digester, reflect.TypeOf(ob))
            fmt.Fprint(digester, ob)
        }
        return digester.Sum(nil)
    }
    

    Output:

    c3d5dcf1d7540d3e46e7d7b5a8c6e8ae
    787ca7e12a2fa991cea5051a64b49d0c
    

    https://play.golang.org/p/nufD3wTJkb