The BigTable documentation has references that column qualifier
is repeated for each row and suggests using the data itself as a qualifier. But there are no similar mentions for the column family
space usage aspects.
So I am confused about whether BigTable stores column family in each row or maybe Google has some techniques to not store column family
in each row. Column families need to be created by using the UI console or cbt CLI tool (or SDKs, and APIs), whereas column qualifiers are created on the fly as and when the data is written. So it is possible that some optimization is done for column family storing as it is already known to the BigTable engine before data is being written.
Here are some references I have checked so far,
Treat column qualifiers as data. Since you have to store a column qualifier for every column, you can save space by naming the column with a value.
column family
is stored in each row, as key structure seems to be (column family + column qualifier + timestamp) - ref,row is essentially a collection of key/value entries, where the key is a combination of the column family, column qualifier and timestamp.
The column family and column qualifier names are repeated for each row. Therefore, keep the names as short as possible to reduce the amount of data that HBase stores and reads.
Currently, I am planning to use column family names as single letters like,
d
for default or m
for metadata and such. But wanted to check if the full names themselves could be used in case the column family doesn't take up storage space in every row.
Bigtable creates an internal integer id for each column family, and this integer is what is stored in the underlying SSTable. So don't worry about the storage implications of a longer column family name - it's just metadata.