[SOLVED] Custom string type with a limited alphabet

Custom string type with a limited alphabet

In my postgres database I have a column which contains sequences of characters. The characters in these sequences are amino acids. There are only 20 amino acids plus some extra characters needed for special purposes.

Currently these are stored with type 'character varying'. I assume that this is inefficient because one byte is used per a character whereas in theory my alphabet could be represented by 5 bits (2 ** 5 = 32). By inefficient I mean it takes more memory than is necessary and that if there are less bits to check comparison methods (such as checking whether one string is equal another or contains another) would require more operations.

Is this correct? Is there some more efficient way I could store this data to minimise the size of the database and to make string operations more efficient?

Solution

Don't do that. The savings on storage are marginal, while the cons are substantial:

You incur higher development and maintenance cost of encoding/decoding to IUPAC code of amino acids.
You lose the ability to search for sequences using powerful regular expressions: for example, SELECT * FROM proteins WHERE sequence ~ '^Y.{2,3}[RK]L'