In my postgres database I have a column which contains sequences of characters. The characters in these sequences are amino acids. There are only 20 amino acids plus some extra characters needed for special purposes.
Currently these are stored with type 'character varying'. I assume that this is inefficient because one byte is used per a character whereas in theory my alphabet could be represented by 5 bits (2 ** 5 = 32). By inefficient I mean it takes more memory than is necessary and that if there are less bits to check comparison methods (such as checking whether one string is equal another or contains another) would require more operations.
Is this correct? Is there some more efficient way I could store this data to minimise the size of the database and to make string operations more efficient?
Don't do that. The savings on storage are marginal, while the cons are substantial:
SELECT * FROM proteins WHERE sequence ~ '^Y.{2,3}[RK]L'