gohashprometheus

FNV hashing with sorted input - performance and hash collisions


prometheus/common provides two implementations to compute FNV1a 64-bit hash value of map[string]string.

One includes sorting of labels (map keys) before feeding data as input to hash algorithm:

func LabelsToSignature(labels map[string]string) uint64 {
...
    labelNames := make([]string, 0, len(labels))
    for labelName := range labels {
        labelNames = append(labelNames, labelName)
    }
    sort.Strings(labelNames)
...
}

Another (fast) implementation skips this step and feeds map to hashing algorithm without sorting labels (map keys) first.

Benchmarking shows that using make([]string, 0, len(labels) in regular (not fast) implementation leads to an allocation and hence 2x performance loss.

Authors claim that fast algorithm is more susceptible to hash collisions.

I wonder why sorting map keys ("labels" as per prometheus vocabulary) before feeding map to FNV algorithm produces less hash collisions?


Solution

  • Because it's pretty easy to construct hash collisions for the fast fingerprint, both {"A":"cswpLMIZpwt","_":"ypfajYg2lsv"} and {"A":"K6sjsNNczPl","_":"KiqbryhzUpn"} have the fast fingerprint ca4cfff38144a92f.

    package main
    
    import (
        "encoding/json"
        "fmt"
        "log"
    
        "github.com/prometheus/common/model"
    )
    
    func main() {
        labels := []string{
            `{"A":"cswpLMIZpwt","_":"ypfajYg2lsv"}`,
            `{"A":"K6sjsNNczPl","_":"KiqbryhzUpn"}`,
        }
    
        for _, l := range labels {
            var ls model.LabelSet
            if err := json.Unmarshal([]byte(l), &ls); err != nil {
                log.Fatalln(err)
            }
    
            fmt.Printf("label set: %v, valid: %v\n", ls, ls.Validate() == nil)
            fmt.Printf("- fingerprint: %v, fast fingerprint: %v\n", ls.Fingerprint(), ls.FastFingerprint())
        }
    }
    

    Try it on the Go Playground.

    This is because the fast fingerprint has to calculate a hash that is independent on the order of the objects, and uses therefore xor to combine the individual hashes. So individual collisions are sufficient for the same fingerprint, while this is not true for the regular fingerprint - the hash of an individual item depends on the hash of the previous items.

    Thanks to Peter Štibraný from Grafana Labs for the collisions.