databrickspartitioninghilbert-curve

implications of Hilbert Curve for Liquid Clustering


I have learned that Databricks' new Liquid Clustering functionality uses a Hilbert Curve to place records into the different DLT (Parquet) underlying files.

I'm guessing that the columns you select for liquid clustering serve as the inputs to this algorithm, and the choice of file is the output of this algorithm.

However I don't really understand the implications, or how this differs from typical database index or primary key algorithms, which are more like trees I think. Actually I was told by somebody working for Databricks that this algorithm also uses a tree, but that doesn't really help me understand.

What are the implications here? For example say you liquid cluster by 4 columns. The first 3 columns are basically independent variables, and the 4th column is dependent on them because it is a hash of them. In a typical database I think this makes the 4th column irrelevant, because by the time you travel that far down the tree, you've already taken advantage of the information it contains, so you should remove that column from the index to save space.

However I was told that it could still be valuable to include the 4th column since a Hilbert Curve is being used. Is this correct and can anybody explain this and add some color?

EDIT: It is possible that I am somewhat mistaken and that the Hilbert Curve is used after a more traditional tree for ZORDER'ing?


Solution

  • What actually is liquid clustering? Conceptually liquid clustering is similar to SQL Server's clustered index - it physically organizes data so that adjacent records (records in the same parquet file) have similar values of clustering columns.

    Technically, it uses Hilbert curves to achieve this. It does not use z-curves, nor does it have anything to do with trees.

    Should you have the column in clustering keys? Yes, if the column is frequently used in filter.

    In your example an index on 3 columns is not useful if you search by hash - you would need to obtain the 3 keys used as input to the hash function. Including the hash as 4th column will organize data by hash values and help to prune datafiles if you look for specific hash value.

    Does it save or waste any space? No. Including or excluding a column only affects how rows are grouped into files - it does affect how much space is occupied.

    So does including a column come for free? Adding a column does impact effectiveness of pruning for the other columns. Let's have an example: a table with columns X and Y with uniform distribution of values 0, 1, 2, 3 in each.

    Let's say your table is 1 GB in size and target parquet file size is 256 MB. Liquid clustering would organize the table into 4 files, but look at how different it is depending on whether clustering key is X vs X+Y:

      Y 0   1   2   3           Y 0   1   2   3
    X +---------------+       X +-------+-------+
    0 |    file 1     |       0 |       |       |
      +---------------+         | file 1| file 2|
    1 |    file 2     |       1 |       |       |
      +---------------+         +-------+-------+
    2 |    file 3     |       2 |       |       |
      +---------------+         | file 3| file 4|
    3 |    file 4     |       3 |       |       |
      +---------------+         +-------+-------+
    

    In the first case (clustering by X):

    In the second case (clustering by X+Y):

    Either option can be better, this depends on how table is queried.

    It is similar with more columns, more values and bigger volume. Including an extra column just makes key ranges of other columns in data files a bit broader.

    Useful references: