pythonnumpyinformation-theory

Computing (Minimum) Description Length - MDL


I have two integer arrays, x and y:

x = np.array([[1, 2, 0, 12,  4],
              [5, 2, 1, 10, 12]]
            )

y = np.array([[1, 2, 0, 11,  4],
              [5, 3, 0, 10, 15]]
            )

And I would like to use x to compress/compute the description length (in units of "bits") for y and then compare the number of "bits saved" as a result of the compression. Given that our data is small, we'll simply use n_bits = 8 (8 bits) to store each integer. In the uncompressed case, it takes a total 2 x 5 x 8 = 80 bits to store y (i.e., DL(y) = 80). Similarly, DL(x) = 80. Now, let's assume that x is the best possible "model"/"hypothesis" for compressing y, then according to the MDL framework:

DL(y, x) = DL(y|x) + DL(x)

Where DL(x) are the number of bits needed to store x and DL(y|x) is the residual bits of y given x:

residual = x - y

array([[ 0,  0,  0, -1,  0],
       [ 0,  1, -1,  0,  3]])

So, what is DL(y|x) for this residual array? According to some examples that I've come across (that I don't fully understand), DL(y|x) can be computed by first identifying the number of unique values in the residual

n_bits = 8
n_unique = len(np.unique(residual))  # 4
DL_residual = 2 * 5 * np.log2(n_unique) + n_unique * n_bits  # 52 bits

If I understand correctly, since n_unique = 4 (i.e., the cardinality of the residual is 4), then it looks like 2 * 5 * np.log2(n_unique) is accounting for the number of bits to store the residual. However, I have no idea why n_unique * n_bits is needed (maybe it is not??). Naively, I would've assumed that 2 * 5 * np.log2(n_unique) was sufficient.

I don't even know if this even the right way to compute the description length for the residual and, ultimately, I need to figure out what the description length of the residual is.


Solution

  • TLDR; You need 2 * 5 * np.log2(n_unique) bits to store which unique value is placed where, but then you also need n_unique * n_bits bits to store the unique values themself.


    The transformation you are applying in order to compress y using x appears to be the following:

    1. Use x as best possible model for y and calculate residual. If x is perfect, you expect to see all zeros in residual. However, since it is not perfec there are a few other values remaining. You already obtained the following residual:
    array([[ 0,  0,  0, -1,  0],
           [ 0,  1, -1,  0,  3]])
    
    1. Many of the values are 0 and a few others are identical. Therefore, to compress residual, we determine the unique integer values and replace each value by an index of the value in a data structure that stores the unique values. I will use a list here, in particular the following:
    [0, -1, 1, 3]
    

    When I replace the value with their indices in this list, the residual becomes:

    array([[ 0,  0,  0,  1,  0],
           [ 0,  2,  1,  0,  2]])
    

    Since the indices are even smaller than the values, we need fewer bits to store them. We only need 2 * 5 * log2(len(unique)) bits to store this transformed array. However, if we only store this, we are lacking the actual values that need to be inserted to reconstruct y! 3. Because of this, we also need to store that list containing the unique values. The elements here are integers with the usual number of bits n_bits. We have n_unique=4 of them, so to store unique we need n_unique * n_bits bits additionally to storing only the indices.

    If x would perfectly predict y, then residual would be all zero. In this case there would only be one unique value (0). With this, you would only have to store the size of the array, or if you didn't want to store this information, a string of 0's encoded as a single bit. You would also have to store 0 of course.

    In fact, you would obtain the same compressed size, also if the residual consisted only of some other value.