machine-learninglstmrecurrent-neural-networktext-compression

How are LSTMs used for data compression?


The following link shows the best performing ML architectures for text compression of the dataset Text8.

What does the output mean (BPC)? And, how can an LSTM compress data?


Solution

  • A language model predicts the next character (or token) given the text so far. It doesn't predict a single character but a probability for each possible next character. (A categorical probability distribution.)

    If you have 256 possible characters, and the model predicts that the next character will be (50% "e", 6% "a", 1% "b", etc.) then you can compress the next character into a single bit (1) if it is actually "e", or into more than a single bit (0 followed by more bits). If the model prediction is good, you will (on average) need less than 8 bits per character (BPC).

    So it is possible to come up with a perfect encoding scheme to compress the text stream if you always know the correct probabilities for the next character. The problem of predicting the next character is a classification problem (the next character is the correct class). So if you have solved the classification problem, you have solved the compression problem (and the other way around).

    To understand this better, you have to learn about entropy, cross-entropy, and coding. For a good explanation, see MacKay's book about Information Theory, especially Chapter 4:

    https://www.inference.org.uk/mackay/itila/book.html (PDF can be downloaded)