hashcassandradatastax-enterpriseshamurmurhash

Generate C* bucket hash from multipart primary key


I will have C* tables that will be very wide. To prevent them to become too wide I have encountered a strategy that could suit me well. It was presented in this video. Bucket Your Partitions Wisely

The good thing with this strategy is that there is no need for a "look-up-table" (it is fast), the bad part is that one needs to know the max amount of buckets and eventually end up with no more buckets to use (not scalable). I know my max bucket size so I will try this.

By calculating a hash from the tables primary keys this can be used as a bucket part together with the rest of the primary keys.

I have come up with the following method to be sure (I think?) that the hash always will be the same for a specific primary key.

Using Guava Hashing:

public static String bucket(List<String> primKeyParts, int maxBuckets) {

    StringBuilder combinedHashString = new StringBuilder();
    primKeyParts.forEach(part ->{
        combinedHashString.append(
            String.valueOf(
                Hashing.consistentHash(Hashing.sha512()
                    .hashBytes(part.getBytes()), maxBuckets)
            )
        );
    });
    return combinedHashString.toString();
}

The reason I use sha512 is to be able to have strings with max characters of 256 (512 bit) otherwise the result will never be the same (as it seems according to my tests).

I am far from being a hashing guru, hence I'm asking the following questions.

Requirement: Between different JVM executions on different nodes/machines the result should always be the same for a given Cassandra primary key?

  1. Can I rely on the mentioned method to do the job?
  2. Is there a better solution of hashing large strings so they always will produce the same result for a given string?
  3. Do I always need to hash from string or could there be a better way of doing this for a C* primary key and always produce same result?

Please, I don't want to discuss data modeling for a specific table, I just want to have a bucket strategy.

EDIT:

Elaborated further and came up with this so the length of string can be arbitrary. What do you say about this one?

public static int murmur3_128_bucket(int maxBuckets, String... primKeyParts) {

    List<HashCode> hashCodes = new ArrayList();
    for(String part : primKeyParts) {
        hashCodes.add(Hashing.murmur3_128().hashString(part, StandardCharsets.UTF_8));
    };
    return Hashing.consistentHash(Hashing.combineOrdered(hashCodes), maxBuckets);
}

Solution

  • I currently use a similar solution in production. So for your method I would change to:

    public static int bucket(List<String> primKeyParts, int maxBuckets) {
      String keyParts = String.join("", primKeyParts);
      return Hashing.consistentHash(
                         Hashing.murmur3_32().hashString(keyParts, Charsets.UTF_8),
                         maxBuckets);
    }
    

    So the differences

    1. Send all the PK parts into the hash function at once.
    2. We actually set the max buckets as a code constant since the consistent hash is only if the max buckets stay the same.
    3. We use MurMur3 hash since we want it to be fast not cryptographically strong.

    For your direct questions 1) Yes the method should do the job. 2) I think with the tweaks above you should be set. 3) The assumption is you need the whole PK?

    I'm not sure you need to use the whole primary key since the expectation is that your partition part of your primary key is gonna be the same for many things which is why you are bucketing. You could just hash the bits that will provide you with good buckets to use in your partition key. In our case we just hash some of the clustering key parts of the PK to generate the bucket id we use as part of the partition key.