c++algorithm lookup perfect-hash universal-hashing

Ultra fast lookup in small sized container of 64-bit integer values using dynamic perfect hashing

I have input keys that can cheaply be converted to a uint64_t (ie, the input contains less than or equal to 64 bits).

Each unique key (not yet in the map) will be assigned some data (a pointer to an object). Much like a std::map<uint64_t, Object*> thus.

Inserting a new key is not time critical because there will only be a small number of total keys, which are never removed again. Where small is less than 100.

A typical implementation would use a std::vector (because of the small number of elements) and either just scan over all elements, or use a binary search (ie boost::flat_map); but this is not optimal enough for me. Once all elements have been inserted the map is static and it should be possible to find the pointer that belongs to a given key in a mere few clock cycles.

What I am thinking of is determining a perfect hash every time a new key is inserted and then use this hash (function) to find back the pointers.

This requires two things:

An algorithm to find a cheap hash function that converts a smallish list of given 64-bit values to 8, 9, 10 (or how many is required) bits with no collisions (aka, a perfect hash function), so that the latter can be used directly in a lookup table (that is 256, 512, 1024... in size).
A method to dynamically create such a hash function; where I have to admit that I am not willing to run an external compiler and load the new hash function dynamically ;-).

Solution

PART 3

I finished my implementation of UltraHash. The result is better than I had expected possible - and indeed much much better than the "multiplication hash" approach.

Conversion of a 64-bit key --> lookup table index takes 67 clock cycles (18 ns).
For 100 keys, the lookup table index isn't 10 bits, or 9, or 8... but 7 bits! And if you have less than 64 keys chances are you'll get even 6 bits. For 200 keys, the returned index is still only 8 bits!
To initialize the UltraHash object with the keys, doesn't take 0.3 seconds - no.. it takes 1 to 3 ms!

The amount of code involved is too much to paste here (I think? ~800 lines). A lot of those lines are comments though. I did explain into excruciating detail what the code is doing and how the algorithm works. If anyone is interested and wants it to be included here, let me know and I'll put some more time it doing that.

The main idea is as follows:

The n keys are 64 bits each - we can see the whole as a matrix of of n x 64. Each of the 64 columns is examined for randomness (extremely rudimentary) after which they are sorted in order of likeliness to contribute to an evenly distributed set of keys of 32 to 64 keys per set. Then by brute force a combination of columns is picked, resulting in partitioning of the keys into said sets where for each set a perfect hash is attempted to be found. Upon failure the number of columns is increased.

Finding the perfect hash for a set of 32 to 64 keys is done by means of linear algebra over GF(2), which involves Gaussian elimination etc.

You can find the code here and the accompanied header here.

Since the code is part of my utils git submodule, it uses other files from that library, as well as debugging code from cwds, but it should be easy to make it work standalone. The project ai-utils-testsuite is a configurable and compile-able project that contains benchmark and test code.

Below follows the comments that precede one of the functions in UltraHash.cxx:

// This function checks if the n given keys (in K) are linear independent and
// if they are - fills in Mᵀ with the magic numbers to convert the keys to a
// number in the range [0, 64> (64 --> 6 bits); the perfect hash.
//
// This involves solving a linear (matrix) equation of the form:
//
// K M = L
//
// Where each row of K is a 64-bit key, M is to be solved, and L exists
// of rows that are - in fact - the values returned by UtraHash::index:
// 6 bit numbers.
//
// For example, let K be a 3x5 matrix (3 keys of 5 bits):
//
//      Column 0 (lsb)
//         |
//         v
//       ⎡ 1 1 0 1 1 ⎤  <-- row 0 = key 1
//   K = ⎢ 1 1 1 0 1 ⎥              key 2
//       ⎣ 1 1 1 1 0 ⎦              key 3
//
// Note that column 0 is stored in the least significant bit of the uint64_t,
// the uint64_t values of the keys are therefore 27, 23 and 15, in this example.
//
// In order to visualize the relationship with the std::array<uint64_t, sz>
// variables we'll put column 0 on the right, but then also invert the rows
// in order to keep an indentity matrix visually unchanged (rotate the whole
// thing 180 degrees).
//
//              Column 0 (lsb)
//                 |
//                 v
//       ⎡ 0 1 1 1 1 ⎤             K[2] = 15
//   K = ⎢ 1 0 1 1 1 ⎥             K[1] = 23
//       ⎣ 1 1 0 1 1 ⎦  <-- row 0, K[0] = 27
//
// Furthermore, let L be such that it simply contains each row number in order:
//
//        Column 0 (lsb)
//           |
//           v
//       ⎡ 1 0 ⎤      row 2: 2
//   L = ⎢ 0 1 ⎥      row 1: 1
//       ⎣ 0 0 ⎦  <-- row 0: 0
//
// Note that values in the rows of L don't really matter, as long as they
// are all different.
//
// The matrix equation K M = L reads, in this example,
//
//                 ⎡ z e ⎤ <-- msb
//   ⎡ 0 1 1 1 1 ⎤ ⎢ y d ⎥   ⎡ 1 0 ⎤
//   ⎢ 1 0 1 1 1 ⎥ ⎢ x c ⎥ = ⎢ 0 1 ⎥
//   ⎣ 1 1 0 1 1 ⎦ ⎢ w b ⎥   ⎣ 0 0 ⎦
//                 ⎣ v a ⎦
//                     ^
//                     |
//                  index 0
//
// where edcba is the binary representation of the value in the first position of MT, etc.
// Namely, if we transpose M, we get:
//
//                lsb (column 0 of Mᵀ; row 0 of M).
//                 |
//                 v
//        ⎡z y x w v ⎤
//   Mᵀ = ⎣e d c b a ⎦ <-- MT[0] (row 0 of Mᵀ / column 0 of M).
//
// The matrix equation remains the same if we substract row i from row j (i != j)
// in both K and L, provided all rows in both matrices are different.
//
// If the keys are linearly independent we'll be able to put K
// in row echelon form (K') where each next row has more zeroes in
// the leading columns (or since we put the matrix upside down,
// in the trailing colums):
//
// For example, subtracting (XOR-ing) the bottom row (row 0) from row 1 and 2,
// and then substracting row 1 from row 2, gives
//
//                     ⎡ z e ⎤
//       ⎡ 1 1 0 0 0 ⎤ ⎢ y d ⎥   ⎡ 1 1 ⎤
// K' =  ⎢ 0 1 1 0 0 ⎥ ⎢ x c ⎥ = ⎢ 0 1 ⎥
//       ⎣ 1 1 0 1 1 ⎦ ⎢ w b ⎥   ⎣ 0 0 ⎦
//                     ⎣ v a ⎦
//
// Where each row (starting at the bottom) has more zeroes on the
// right than the previous row (the row below it).
//
// This already proves that the keys were linearly independent,
// because there is no row that has all zeroes in K.
// Note that even if there was a row in K with all zeroes, we might
// still be able to find a solution if also L has all zeroes in that row.
//
// The above is corresponding to the set of equations:
//                         ⎡ 1 1 0 ⎤
//                    Lᵀ = ⎣ 1 0 0 ⎦
//    1 1 0 0 0
//    e+d                =   1
//    z+y                =   1
//
//    0 1 1 0 0
//      d+c              =     1
//      y+x              =     0
//
//    1 1 0 1 1
//    e+d  +b+a          =       0
//    z+y  +w+v          =       0
//
// and from top to bottom we want to
//
// 1) set d to 1 (keep e at 0).
//    set y to 1 (keep z at 0).
// 2) set c to 1 + d.
//    set x to 0 + y.
// 3) set a to 0 + e + d (keep b at 0).
//    set v to 0 + z + y (keep w at 0).
//
// It is easy to verify that this answer also holds in the original
// equation:
//                     1 0 <---- 1 0 <-- column m.
//   .-- row n         v v       | |    .-- row n.
//   v               ⎡ 0 0 ⎤     v v    v
//   2 ⎡ 0 1 1 1 1 ⎤ ⎢ 1 1 ⎥   ⎡ 1 0 ⎤  2
//   1 ⎢ 1 0 1 1 1 ⎥ ⎢ 1 0 ⎥ = ⎢ 0 1 ⎥  1
//   0 ⎣ 1 1 0 1 1 ⎦ ⎢ 0 0 ⎥   ⎣ 0 0 ⎦  0
//                   ⎣ 1 1 ⎦
//
// That is, the bit in L at row n and column m is the parity of
// the bit-wise AND between the key at row n in K and column m of M.
//
// Finally we transpose this result to get the output variable MT:
//
//        ⎡ 0 1 1 0 1 ⎤
//   Mᵀ = ⎣ 0 1 0 0 1 ⎦
//
// Since a zero key always results in a zero in L, we can not use
// the value zero in L when one of the keys in zero. Instead we
// ignore the key and use different and non-zero values in L.
//
// For example, if we had the same keys as in the above example,
// but ALSO an all zeroes key:
//
//       ⎡ 1 1 0 1 1 ⎤  <-- row 0 = key 1
//       ⎢ 0 0 0 0 0 ⎥              key 2
//   K = ⎢ 0 1 1 0 0 ⎥              key 3
//       ⎣ 0 1 1 1 1 ⎦              key 4
//
// Then the zero key is removed and we use for L the matrix
//
//       ⎡ 1 1 ⎤
//   L = ⎢ 1 0 ⎥
//       ⎣ 0 1 ⎦
//
// The algorithm to solve M from K' in row echelon form is, as you can see
// from the 1) 2) 3) steps above, done per whole row of M:
//
// First, we start with M being all zeroes:
//
// n=5      row
// ⎡ 0 0 ⎤  4
// ⎢ 0 0 ⎥  3
// ⎢ 0 0 ⎥  2
// ⎢ 0 0 ⎥  1
// ⎣ 0 0 ⎦  0
//
// We skip over row 4, leaving it all zeroes, because the first row of
// K' is 1 1 0 0 0
//       4 3 <-- start with row 3.
//         ^-- the last 1.
// And that row is made equal to the first row of L: 1 1.
// The result after step 1) is therefore:
//
// n=5      row
// ⎡ 0 0 ⎤  4
// ⎢ 1 1 ⎥  3   <-- last updated.
// ⎢ 0 0 ⎥  2
// ⎢ 0 0 ⎥  1
// ⎣ 0 0 ⎦  0
//
// The next row of K' is 0 1 1 0 0
//                       4 3 2 <-- continue with row 2.
//                           ^-- the last 1.
// And that row is made equal to L's second row [ 0 1 ]
// XOR the current row of K' times M so far:
//
//               ⎡ 0 0 ⎤
//               ⎢ 1 1 ⎥
// [ 0 1 1 0 0 ] ⎢ 0 0 ⎥ = [ 1 1 ]
//               ⎢ 0 0 ⎥
//               ⎣ 0 0 ⎦
//
// [ 0 1 ] + [ 1 1 ] = [ 1 0 ]
//
// So that after step 2) M has become:
//
// n=5      row
// ⎡ 0 0 ⎤  4
// ⎢ 1 1 ⎥  3
// ⎢ 1 0 ⎥  2   <-- last updated.
// ⎢ 0 0 ⎥  1
// ⎣ 0 0 ⎦  0
//
// The final row of K' is 1 1 0 1 1
//                        4 3 2 1 0 <-- continue with row 0.
//                                ^-- the last 1.
// The next row of L is [ 0 0 ] and
//
//                         ⎡ 0 0 ⎤
//                         ⎢ 1 1 ⎥
// [ 0 0 ] + [ 1 1 0 1 1 ] ⎢ 1 0 ⎥ = [ 1 1 ]
//                         ⎢ 0 0 ⎥
//                         ⎣ 0 0 ⎦
//
// So that after step 3) M has become:
//
// n=5      row
// ⎡ 0 0 ⎤  4
// ⎢ 1 1 ⎥  3
// ⎢ 1 0 ⎥  2
// ⎢ 0 0 ⎥  1
// ⎣ 1 1 ⎦  0   <-- last updated.
//

And trust me, that's the longest comment I ever wrote for one function :).

ORIGINAL POST:

The random multiplication factor approach

I tried the same approach of samgak: my hash function was just a multiplication of the uint64_t key with a uint64_t multiplication factor, and then I tried random values for the multiplication factor until the least number of high-bits of the result were different. It turns out that this takes easily up till 0.3 seconds to reach a 9 bit output.

Here I used the following function to find the number of high bits required:

// Return the maximal right-shift that is possible on
// the values in `hashes` such that the results are still all different.
int max_shift(std::vector<uint32_t> hashes)
{
  std::sort(hashes.begin(), hashes.end());
  int sm = 0;
  int sz = hashes.size();
  for (int i = 1; i < sz; ++i)
  {
    uint32_t d = hashes[i - 1] ^ hashes[i];
    int s = std::countl_zero(d);
    if (s > sm)
      sm = s;
  }
  return 31 - sm;
}

The hashes here are, for example, just the most significant 32-bit of the result of the multiplication; the lower 32 bit would be shifted out / ignored anyway.

Considering that 100 values, being less than 128, theoretically fit into 7 bits (although I never even found 8 bits with the above approach; which isn't weird when you realize that the chance for a random attempt corresponds with the Birthday Problem with 100 people and 256 birthdays; which has a 1 in 5807421181 chance of having no collisions) and that I found that waiting up to 0.3 seconds, and theoretically much longer, was a bit annoying -- I started to think about a way to calculate the hash function.

In order to be able to do any calculations, I decided to use a linear algebra approach.

Using linear algebra

The idea is to use linear algebra (matrices, vectors). And since we are working with arbitrary bits, the most logical thing to do is to work over $\mathbb{Z}_{2}$ . Ok, this external conversion of LaTeX to images isn't working, I'll use UTF8: ℤ₂

Let all the input keys be represented by 64-dimensional column vectors over ℤ₂: 𝐕ᵢ.

Note there can exist at most 64 such linearly independent vectors. Assume for a moment that we have 64 such keys. Then we can construct the matrix

K = [𝐕₀ 𝐕₁ … 𝐕ᵢ … 𝐕₆₃]

and calculate the inverse of that (because of 𝐕ᵢ are linearly independent). Lets denote this inverse with K⁻¹.

Further more, note that we need 6 bits to enumerate 64 input keys. Now let the 6×64 matrix L be the matrix that exist of all 6-dimensional column vectors:

    ⎡ 0 0 0 0 0     1 1 ⎤
    ⎢ 0 0 0 0 0     1 1 ⎥
L = ⎢ 0 0 0 0 0 ... 1 1 ⎥
    ⎢ 0 0 0 0 1     1 1 ⎥
    ⎢ 0 0 1 1 0     1 1 ⎥
    ⎣ 0 1 0 1 0     0 1 ⎦

Let the 6×64 matrix M then be

M = L K⁻¹

Then

M K = L K⁻¹ K = L

and left-multiplying any input key 𝐕ᵢ with M will give a unique 6-bit column vector; aka - the perfect hash.

Linearly dependent keys

The main problem with this approach is of course that the input keys might not be linearly independent, which is certainly going to be the case when there are more than 64 keys.

In order to detect if/when keys are not linearly independent, we can simply triangulate the matrix K when new keys come in. In order to give examples, lets assume that we have vectors/keys of 4 bits instead of 64.

Consider the following five distinct input keys:

   ⎛0⎞     ⎛1⎞     ⎛1⎞     ⎛0⎞     ⎛1⎞
V₀=⎜1⎟, V₁=⎜0⎟, V₂=⎜0⎟, V₃=⎜0⎟, V₄=⎜1⎟
   ⎜1⎟     ⎜1⎟     ⎜0⎟     ⎜1⎟     ⎜0⎟
   ⎝0⎠     ⎝1⎠     ⎝1⎠     ⎝0⎠     ⎝1⎠

Then we start with a identity matrix (all ones on the main diagonal and the reset zeroes), and put the first key such that the top 1 aligns with a 1 on the matrix diagonal:

    ⎡ 0 0 0 1 ⎤
K'= ⎢ 0 0 1 0 ⎥
    ⎢ 0 1 1 0 ⎥
    ⎣ 1 0 0 0 ⎦
          *     <-- used

As a result, this matrix has an inverse, because the upper-left triangle is all zeroes and the main diagonal is all ones.

Next key, V₁, can likewise as-is be put in this matrix:

    ⎡ 0 0 0 1 ⎤
K'= ⎢ 0 0 1 0 ⎥
    ⎢ 0 1 1 1 ⎥
    ⎣ 1 0 0 1 ⎦
          * *   <-- used

Now column four is used, so we can't use it to put in V₂ (that also has a 1 at the top). Therefore, we subtract the fourth column from V₂ first (note that subtraction over ℤ₂ is addition modulo 2, aka an XOR operation). V₂ then becomes:

⎛1⎞   ⎛1⎞  ⎛0⎞
⎜0⎟ - ⎜0⎟ =⎜0⎟
⎜0⎟   ⎜1⎟  ⎜1⎟
⎝1⎠   ⎝1⎠  ⎝0⎠

and we can put that in column 2 of the matrix (which doesn't change it). V₃ now no longer can be put in column 2, because it is "used".

If we try to use the same trick, subtracting column 2 from V₃ we get

⎛0⎞   ⎛0⎞  ⎛0⎞
⎜0⎟ - ⎜0⎟ =⎜0⎟
⎜1⎟   ⎜1⎟  ⎜0⎟
⎝0⎠   ⎝0⎠  ⎝0⎠

all zeroes. Which indicates that the first four keys were linearly dependent. We can mitigate this by extending the vector with a 1 at the bottom (and all the others with zeroes):

   ⎛0⎞     ⎛1⎞     ⎛1⎞     ⎛0⎞     ⎛1⎞
V₀=⎜1⎟, V₁=⎜0⎟, V₂=⎜0⎟, V₃=⎜0⎟, V₄=⎜1⎟
   ⎜1⎟     ⎜1⎟     ⎜0⎟     ⎜1⎟     ⎜0⎟
   ⎜0⎟     ⎜1⎟     ⎜1⎟     ⎜0⎟     ⎜1⎟
   ⎝0⎠     ⎝0⎠     ⎝0⎠     ⎝1⎠     ⎝0⎠

If we would repeat the same thing with these inputs, we get the exact same matrix, but with an extra 1 in the bottom left:

    ⎡ 0 0 0 0 1 ⎤
K'= ⎢ 0 0 0 1 0 ⎥
    ⎢ 0 0 1 1 1 ⎥
    ⎢ 0 1 0 0 1 ⎥
    ⎣ 1 0 0 0 0 ⎦
      *   * * *   <-- used

V₄ turns out to be linear dependent as well: it has a 1 at the top, thus subtract the last column, and then we get the (now) fourth column which was already used. Therefore we have to add another row:

   ⎛0⎞     ⎛1⎞     ⎛1⎞     ⎛0⎞     ⎛1⎞
V₀=⎜1⎟, V₁=⎜0⎟, V₂=⎜0⎟, V₃=⎜0⎟, V₄=⎜1⎟
   ⎜1⎟     ⎜1⎟     ⎜0⎟     ⎜1⎟     ⎜0⎟
   ⎜0⎟     ⎜1⎟     ⎜1⎟     ⎜0⎟     ⎜1⎟
   ⎜0⎟     ⎜0⎟     ⎜0⎟     ⎜1⎟     ⎜0⎟
   ⎝0⎠     ⎝0⎠     ⎝0⎠     ⎝0⎠     ⎝1⎠

and

    ⎡ 0 0 0 0 0 1 ⎤
K'= ⎢ 0 0 0 0 1 0 ⎥
    ⎢ 0 0 0 1 1 1 ⎥
    ⎢ 0 0 1 0 0 1 ⎥
    ⎢ 0 1 0 0 0 0 ⎥
    ⎣ 1 0 0 0 0 0 ⎦
      * *   * * *   <-- used

Now lets construct K as described above (just put all the V's in it):

    ⎡ 0 1 1 . 0 1 ⎤
    ⎢ 1 0 0 . 0 1 ⎥
K = ⎢ 1 1 0 . 1 0 ⎥
    ⎢ 0 1 1 . 0 1 ⎥
    ⎢ 0 0 0 . 1 0 ⎥
    ⎣ 0 0 0 . 0 1 ⎦

I left the fourth column empty because we could only find three linearly independent keys even though the (original) keys are 4 bits. Aka, if the keys were 64 bits and we could only find 62 that were linearly independent, we would leave 2 columns empty. That is ... we'll fill them in with pseudo keys that ARE linearly independent with the previous ones; that will lead those columns having their lower bits zero:

    ⎡ 0 1 1 . 0 1 ⎤
    ⎢ 1 0 0 . 0 1 ⎥
K = ⎢ 1 1 0 . 1 0 ⎥
    ⎢ 0 1 1 . 0 1 ⎥
    ⎢ 0 0 0 0 1 0 ⎥
    ⎣ 0 0 0 0 0 1 ⎦

Moreover, the missing linearly independent keys are trivially the unused column(s) of K', so that K can be completed by extending the diagonal of ones from the bottom right to the top left:

    ⎡ 0 1 1 0 0 1 ⎤
    ⎢ 1 0 0 0 0 1 ⎥
K = ⎢ 1 1 0 0 1 0 ⎥
    ⎢ 0 1 1 1 0 1 ⎥
    ⎢ 0 0 0 0 1 0 ⎥
    ⎣ 0 0 0 0 0 1 ⎦

Because now all columns are linearly independent, this matrix has an inverse, in this case that is

    ⎡ 0 1 0 0 0 1 ⎤
    ⎢ 0 1 1 0 1 1 ⎥
K⁻¹=⎢ 1 1 1 0 1 0 ⎥
    ⎢ 1 0 0 1 0 0 ⎥
    ⎢ 0 0 0 0 1 0 ⎥
    ⎣ 0 0 0 0 0 1 ⎦

Originally we had 5 input keys, but even with the extra pseudo key we can enumerate that with 3 bits. However, lets take into account we're not really interested in what the pseudo key results in - so we use the following L:

    ⎡ 0 0 0 ? 0 1 ⎤
L = ⎢ 0 0 1 ? 1 0 ⎥
    ⎣ 0 1 0 ? 1 0 ⎦

where you can fill in whatever you want in column four. If you fill in the next number (5):

    ⎡ 0 0 0 1 0 1 ⎤
L = ⎢ 0 0 1 0 1 0 ⎥
    ⎣ 0 1 0 1 1 0 ⎦

then

            ⎡ 1 0 0 1 0 1 ⎤
M = L K⁻¹ = ⎢ 1 1 1 0 0 0 ⎥
            ⎣ 1 1 1 1 0 1 ⎦

Constructing a Kₒ with the original keys (V₀, V₁, V₂, V₃ and V₄):

   ⎡ 0 1 1 0 1 ⎤
Kₒ=⎢ 1 0 0 0 1 ⎥
   ⎢ 1 1 0 1 0 ⎥
   ⎣ 0 1 1 0 1 ⎦

and using M without the last two columns,

    ⎡ 1 0 0 1 ⎤
M = ⎢ 1 1 1 0 ⎥
    ⎣ 1 1 1 1 ⎦

we get

       ⎡ 0 0 0 0 0 ⎤
M Kₒ = ⎢ 0 0 1 1 0 ⎥
       ⎣ 0 1 0 1 1 ⎦

showing that multiplying with this matrix M is not a perfect hash: the last column is the same as the second one.

The exact change is obvious: if we had used the full matrix (including the last two columns) and the extended keys then only the last key has a 1 in the last column and is therefore XOR-ed with the last column of the matrix:

            ⮶what we get
  1     1   0
  0 XOR 0 = 0
  0     1   1
  ⮴expected

Thus, the only way to correct the output is by knowing in which column of K it belongs (the last column) based on the first four bits of that column, which is much like the original problem we try to solve to begin with.

PART 2

The trick is to divide the keys into sets of 64 or less linearly independent keys. This is easy because for fully random 64 bit values, on average the first 62.18 keys will be linearly independent, which is close enough to 64 to be efficient.

The keys should be well hashed however, for example if the 64-bit keys would be uint64_t with values between 0 till 64, only using 6 bits, then you'd obviously only find 6 linearly independent keys at a time and would at least need 64 / 6 = 11 sets.

This would slow down the look up because you will have to be able to determine in which set a key belongs before you can multiply it with the correct matrix.

One way to do this is to sort the keys and remember the values of the first key in the next set (the first key that is linear dependent on the keys in the previous set).

For example, lets say we have three sets:

S₀={V₀,V₁,V₂}, S₁={V₃,V₄} and S₂={V₅,V₆,V₇,V₈}

because V₃ can be expressed in terms of the keys in S₀ (e.g. V₀ + V₂) and V₅ could be expressed as V₃ + V₄.

Lets try to think of an example where this is the case. Remember that we sorted the keys, so V₀<V₁<V₂<V₃<V₄<V₅<V₆<V₇<V₈ when represented as numbers. Lets say that the least significant bit of the unsigned integers that they are stored in correspond with row 0 of the vectors, then possible values could be:

{ 0b00001, 0b00010, 0b00100 }, { 0b00101, 0b01000 }, { 0b01101, 0b01111, 0b11001, 0b11110 }

   ⎛0⎞     ⎛0⎞     ⎛0⎞     ⎛0⎞     ⎛0⎞     ⎛0⎞     ⎛0⎞    ⎛1⎞     ⎛1⎞
   ⎜0⎟     ⎜0⎟     ⎜0⎟     ⎜0⎟     ⎜1⎟     ⎜1⎟     ⎜1⎟    ⎜1⎟     ⎜1⎟
V₀=⎜0⎟, V₁=⎜0⎟, V₂=⎜1⎟, V₃=⎜1⎟, V₄=⎜0⎟, V₅=⎜1⎟, V₆=⎜1⎟,V₇=⎜0⎟, V₈=⎜1⎟
   ⎜0⎟     ⎜1⎟     ⎜0⎟     ⎜0⎟     ⎜0⎟     ⎜0⎟     ⎜1⎟    ⎜0⎟     ⎜1⎟
   ⎝1⎠     ⎝0⎠     ⎝0⎠     ⎝1⎠     ⎝0⎠     ⎝1⎠     ⎝1⎠    ⎝1⎠     ⎝0⎠

so that indeed V₃=V₀+V₂ and V₅=V₃+V₄.

But we can do better! There is no need to sort the keys and use < to specify the boundaries between sets. A very effective way would be to use a binary tree to find the correct set where each time one particular bit is used.

Note that with an expectation of ~100 keys and sets with a size of ~62 - we expect to have only two sets! So only a single bit to switch set on. This bit can be found by trial and error. In the above example we can create the following two sets of linearly independent vectors:

S₀={V₁,V₂,V₄,V₈} and S₁={V₀,V₃,V₅,V₆,V₇}

by simply switching on the least significant bit.

The total lookup code for one hundred 64-bit keys will then exist of testing 1 bit, that determines which table of six 64-bit values to use, and then six XOR operations (that can be done in parallel in theory; so that should be around 3 clock cycles) and likewise 6 parity tests - which I'll count as 6 clock cycles (see this post for more on parity). Which brings me to a dozen clock cycles or so. And the result would be 7 bit, not 9 bit! Also, calculating these matrices can be done much much faster than 0.3 seconds.

I'll post an update when I have working code.