google-cloud-bigtablebigtable

Is column family name repeated in each row in BigTable?


The BigTable documentation has references that column qualifier is repeated for each row and suggests using the data itself as a qualifier. But there are no similar mentions for the column family space usage aspects.

So I am confused about whether BigTable stores column family in each row or maybe Google has some techniques to not store column family in each row. Column families need to be created by using the UI console or cbt CLI tool (or SDKs, and APIs), whereas column qualifiers are created on the fly as and when the data is written. So it is possible that some optimization is done for column family storing as it is already known to the BigTable engine before data is being written.

Here are some references I have checked so far,

  1. Column qualifiers are repeated in each row but no mention of column family in the description or the example - ref

Treat column qualifiers as data. Since you have to store a column qualifier for every column, you can save space by naming the column with a value.

  1. The key structure mentioned in unused column space usage has an implicit meaning that column family is stored in each row, as key structure seems to be (column family + column qualifier + timestamp) - ref,

row is essentially a collection of key/value entries, where the key is a combination of the column family, column qualifier and timestamp.

  1. In the case of HBase, the column family is also stored in each row and it is suggested to use short names for it along with the column qualifier. As HBase is a project created after the BigTable paper, there is a chance that BigTable also stores the column family in each row. - ref

The column family and column qualifier names are repeated for each row. Therefore, keep the names as short as possible to reduce the amount of data that HBase stores and reads.

Currently, I am planning to use column family names as single letters like, d for default or m for metadata and such. But wanted to check if the full names themselves could be used in case the column family doesn't take up storage space in every row.


Solution

  • Bigtable creates an internal integer id for each column family, and this integer is what is stored in the underlying SSTable. So don't worry about the storage implications of a longer column family name - it's just metadata.