When selecting a compression encoding for a VARCHAR
column in Redshift, two options that present themselves when the column contains a small set of potential string values are text255
and bytedict
. Looking at the descriptions of these in AWS documentation, they appear to do the same thing with the exception that bytedict
can hold 20 more values than text255
. So, what advantage does text255
offer over bytedict
?
To provide an example here, suppose I have a column containing the regions of Japan: ["hokkaido", "tohoku", "kanto", "hokuriku", "chubu", "kansai", "chokoku", "shikoku", "kyushu", "okinawa"].
These values would fit into a VARCHAR(8)
column. If I were to use a text255
encoding, each row in the column would consume 1 byte with an additional 80 bytes for the dictionary entries. From my understanding, bytedict
would offer the exact same storage and would operate similarly for database queries. So, what's the difference? I notice that text255
appears to use the term "words" rather than "column values" so does it compress by word? If so, I can't imagine how that would work for querying.
Bytedict holds any data types but you noted this. Bytedict also operation on the entire value of the column. Text256 operates on words in text values of the column.
Using your example if the column in question was "addresses in Japan" and the region was part of the text in this column, then text255 could replace these region names with a 1 byte index. Each of these words (several bytes) in the column being replaced by 1 byte. Bytedict can only index whole values and since there are likely a very large number of unique values in such a column it would do little to compress this column.
Text255 is better for text string values with lots of repeated words in them.
It is, though, usually a good idea to perform and analyze compression on representative data (when available) to see what is the best encoding from all the possibilities.