monetdb

In MonetDB internals, what goes in the different files named: X.thashb, X.thashl, X.tail?


While inspecting the underlying files corresponding to a particular table's column in MonetDB, I was surprised to find that the disk storage used for this column was much more than I expected. The column in question was of type int, so I would have expected rows * 4 bytes of data used for the data itself, and then perhaps another rows * 4 bytes used for the index of that column. I.e. say I have 1,000,000,000 rows, I would have expected the disk usage here to be around 8 GB.

Instead, what I found was the disk use was roughly 271% more space than this baseline, spread across three files: 51.thashb, 51.thashl and 51.tail.

In order to better understand MonetDB's internals and perhaps reduce disk usage, I'm asking what the difference is between these three files? Only the size of the 51.thashl seems directly related to the number of rows (in the case of .thashl files, they appear to be sized at 8 bytes per int row, so I'd assume this contains an index of sorts).

I'd like to know more about the files. The code on GitHub doesn't provide much insight into this.

Thanks in advance for any insights, and to any maintainers who respond, thanks so much for your ongoing work on this great database. It has made my analytic jobs a lot easier.


Solution

  • The .tail file contains the actual data of the column. The .thashb and .thashl files together contain the hash table and are optional (as in, they can be deleted without ill effect apart from efficiency).

    The .thashb file contains the hash buckets. Its size is correlated with the number of distinct values in the column. The .thashl files contains the links for the collision lists and has the same number of rows as the data. The width of these rows depends on the number of rows since the row numbers must fit. So if you have 100 rows, the width can be 1 byte, so you get 1 * #rows. If you have a billion rows, you need 4 bytes to fit the row numbers, so you get 4 * #rows.

    If you have a billion distinct values, the .thashb file is going to have well over a billion buckets, and each of them must be able to fit a row number, so 4 bytes per bucket.

    The reason there are two files is that each can grow independently. If data gets added, the hash is updated, so the .thashl file needs to grow. If too many distinct values are added, the .thashb file needs to grow.