huffman-codedeflatehuffman-tree

How can one correctly extract Huffman Tree from a dynamic compressed block?


My problem is similar to the one solved HERE, with few differences.

So the a Deflated block with dynamic Huffman codes is structured like this:

|Block Header|compressed block|end of block|

Then Block header's bit stream is like this:

|block type|HLIT|HDIST|HCLEN|

So i ran Pyflate and i got the following

Dynamic Huffman tree:

length codes: 21, distances codes: 24, code_lengths_length: 16

lengths:

[3, 0, 6, 6, 5, 3, 3, 2, 3, 3, 0, 0, 0, 0, 0, 0, 5, 6, 6]

lengths to spans:

[(0, 3), (1, 0), (2, 6), (3, 6), (4, 5), (5, 3), (6, 3), (7, 2), (8, 3), (9, 3), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0), (16, 5), (17, 6), (18, 6), (19, -1)]

From this byte stream:

EC 5D 69 73 AB 36 14 FD DC CE F4 3F A8 E9 3E 0D 0E 60 B0 8D 5F 92 36 CD D2 7D 4F F7 E9 64 64 10 09 0D 36 2E E0 26 E9 F2 DF 7B 24 B0 01 AF C2 30 5D E5 F7 92 D8 02 1D 6D F7 5E E9 1E 5D E4 8F DF FF F8 52 FB 9A C5 49 10 4D 86 C4 E8 E8 2F 3C 7F 41 53 36 24 5F CE 26 87 C4 B4 C9 15 1B 11 53 37 2D FC 1A 76 AD A1 35 20 6F EA A6 8E DB AE E2 68 3C 24 67 61 30 A2 23 DA 71 A3 31 39 A6 D9 87 B7 13 16 FF 12 B8 AC 43 8B 8B A7 2F 3C FF E5 6C F4 13 73 D3 21 B9 0A 26 1E 49 EF 18 89 83 DB BB 94 4C E3 C8 9B B9 69 42 FC 28 26 4F D1 2C 26 A3 59 12 4C 58 92 BC F0 FC F5 5D CC A8 A7 5D 47 D3 C0 AD 95 F1 63 FC A6 B7 4C 7B FF 62 48 8E 7B B6 37 1A 0C AC 41 A7 D7 D5 A9 6E 9A 7A A7 EF E8 D4 E9 F4 BD BE FF E5 C7 D7 9F BD FF C9 CD D9 C5 C5 E5 C5 CD 3B 5F 7C FA E1 E5 27 6F 8F 1F 3B B7 51 74 1B B2 BC E6 D7 D1 90 1C A4 3E 9D DC 87 C1 44 7F FB 76 4C 83 90 5F 3A 20 C7 4B A9 79 86 F3 68 92 B2 49 AA 5D C7 74 92 F8 2C D6 2E 27 6E E4 05 93 DB 21 F9 79 16 A5 CC D3 A6 71 30 49 E9 28 64 A5 9B 9F A6 E8 F7 94 3D A6 47 77 E9 38 7C 46 DC 3B 1A 27 2C 3D 39 98 A5 BE 36 38 78 E1 79 FE EF 98 5F 43 11 E4 F8 0E 3D 73 7A 62 EA 78 4F F2 3F C7 2F 6A DA 0F 81 4F 5E 1C 27 D1 8F A7 FC D3 E9 FC CA 98 A5 94 B8 59 51 27 DD 8B 83 F7 2F F1 9B 79 B7 EC E0 B4 94 F9 F8 C5 1F D8 C4 0B FC 1F 8B 9C 15 80 09 1D 33 9E FB 97 80 3D 4C A3 38 3D 28 43 3E 04 5E 7A 87 37 1E E3 A3 AF 89 8F 87 24 98 04 69 40 43 2D 71 69 C8 F3 9E BC F0 BC 51 94 99 A4 4F 21 23 E9 D3 54 C0 8A C6 BB 49 72 20 EA FE C2 F3 2F 45 B3 34 8C A2 7B 42 5F 78 FE B9 DF A6 D4 13 7D A8 3F FB E3 85 E7 47 91 F7 C4 13 45 29 43 43 D7 5F 79 31 18 F3 2A D1 49 FA 0C 17 C6 C1 44 2B AE 89 14 1A DF 06 13 E4 C6 FB 1C 2A FB A0 3D B0 D1 7D 90 6A BC 74 2D 09 7E 65 1A F5 7E 9A 25 A9 C8 C9 CB 12 03 75 48 52 8F 17 38 8A 62 0F 23 EA 46 61 48 A7 09 1B CE DF 88 22 92 48 13 37 6B 61 32 A5 2E 1B EA D3 B4 9A 1E 2F D2 39 70 30 BE 2D 20 B3 CA DC 31 2E DB 43 3A 4B 23 FE 31 84 3C 6B 59 DA A2 21 E8 14 9E 3C 9C 44 13 51 AA A8 B7 C7 DC 28 A6 29 D4 58 A4 73 F4 29 C7 F6 82 64 1A D2 A7 E1 28 8C DC FB 72 37 4C 1F F9 3D 9D 2F 20 45 1F 27 B7 EF AC 74 A7 B8 4A 9F C6 37 A2 E2 37 0F 31 9D 6E EA 6F 7E EB 4B 23 EA DE DF C6 D1 6C E2 5D F3 0C FC DE 8D 3D BE 0E 64 B9 B5 AB 45 74 2E 1F 53 16 4F 68 78 1E D2 24 39 24 D5 CF 64 BA 92 82 BE 9E AC 24 FA 10

But i can't find the bytes corresponding to Pyflate's output except for 21 which is first 5 bits of byte EC and Pyflate did actually decompress the data successfully.

I need to do this manually myself without third party software to better understand it, So i know about RFC1951 but it didn't help on this one.

How does one correctly extract Huffman tree from compressed data?

Or if someone can provide an idea of how the block is really structure because clearly the above structure is just an overview.

My goal is to understand how much space does the Huffman data covers before the actual data from the file being compressed.


Solution

  • So i know about RFC1951 but it didn't help on this one.

    RFC 1951 completely and exactly describes the deflate format, so it is not possible that it can't help. Unless, that is, you have not spent the time and effort to read and understand it.

    Then Block header's bit stream is like this:

    |block type|HLIT|HDIST|HCLEN|

    The dynamic block header starts like that. The header continues for many more bits. For that example, the dynamic header is 495 bits. A little less than 62 bytes.

    Along with studying RFC 1951, you can use infgen to disassemble example streams and see the pieces. Here is the complete dynamic block header from your stream as shown by infgen, including the bits from the stream on the right (using the -dd option of infgen):

    ! infgen 3.2 output
    !
    dynamic         ! 10 0
    count 286 30 14     ! 1010 11101 11101
    code 16 4       ! 100
    code 17 6       ! 110
    code 18 6       ! 110
    code 0 4        ! 100
    code 8 3        ! 011
    code 7 3        ! 011
    code 9 5        ! 101
    code 6 2        ! 010
    code 10 3       ! 011
    code 5 3        ! 011
    code 4 5        ! 101 000
    code 3 5        ! 101 000
    zeros 9         ! 110 011111
    lens 10         ! 101
    lens 10         ! 101
    lens 0          ! 0011
    lens 0          ! 0011
    lens 10         ! 101
    zeros 18        ! 0000111 111111
    lens 5          ! 010
    lens 10         ! 101
    lens 8          ! 001
    lens 10         ! 101
    repeat 6        ! 11 1011
    lens 0          ! 0011
    lens 10         ! 101
    lens 8          ! 001
    lens 6          ! 00
    lens 7          ! 110
    lens 8          ! 001
    lens 6          ! 00
    lens 6          ! 00
    lens 6          ! 00
    lens 7          ! 110
    lens 6          ! 00
    lens 6          ! 00
    lens 7          ! 110
    lens 7          ! 110
    lens 7          ! 110
    lens 6          ! 00
    lens 7          ! 110
    lens 9          ! 01111
    lens 8          ! 001
    lens 8          ! 001
    lens 8          ! 001
    lens 10         ! 101
    lens 10         ! 101
    lens 8          ! 001
    lens 10         ! 101
    lens 8          ! 001
    repeat 3        ! 00 1011
    lens 10         ! 101
    repeat 6        ! 11 1011
    repeat 6        ! 11 1011
    lens 8          ! 001
    lens 10         ! 101
    repeat 6        ! 11 1011
    lens 0          ! 0011
    lens 10         ! 101
    lens 0          ! 0011
    lens 8          ! 001
    lens 0          ! 0011
    lens 5          ! 010
    lens 7          ! 110
    lens 6          ! 00
    lens 6          ! 00
    lens 5          ! 010
    lens 6          ! 00
    lens 8          ! 001
    lens 8          ! 001
    lens 6          ! 00
    lens 10         ! 101
    lens 8          ! 001
    lens 6          ! 00
    lens 7          ! 110
    lens 7          ! 110
    lens 6          ! 00
    lens 7          ! 110
    lens 10         ! 101
    lens 6          ! 00
    lens 6          ! 00
    lens 6          ! 00
    lens 7          ! 110
    lens 10         ! 101
    lens 8          ! 001
    lens 8          ! 001
    lens 8          ! 001
    lens 10         ! 101
    repeat 3        ! 00 1011
    zeros 130       ! 1110111 111111
    lens 10         ! 101
    lens 3          ! 00111
    lens 5          ! 010
    lens 5          ! 010
    lens 6          ! 00
    lens 6          ! 00
    lens 7          ! 110
    lens 7          ! 110
    lens 6          ! 00
    lens 6          ! 00
    lens 6          ! 00
    lens 7          ! 110
    repeat 5        ! 10 1011
    lens 5          ! 010
    lens 6          ! 00
    lens 7          ! 110
    lens 6          ! 00
    lens 6          ! 00
    lens 0          ! 0011
    lens 10         ! 101
    repeat 5        ! 10 1011
    lens 5          ! 010
    lens 7          ! 110
    lens 9          ! 01111
    lens 9          ! 01111
    lens 8          ! 001
    lens 8          ! 001
    lens 8          ! 001
    lens 7          ! 110
    lens 7          ! 110
    lens 5          ! 010
    lens 6          ! 00
    lens 6          ! 00
    lens 5          ! 010
    lens 3          ! 00111
    lens 5          ! 010
    repeat 4        ! 01 1011
    lens 4          ! 10111
    lens 4          ! 10111
    lens 4          ! 10111
    lens 5          ! 010
    lens 5          ! 010
    lens 4          ! 10111
    lens 3          ! 00111
    lens 5          ! 010
    lens 4          ! 10111
    lens 6          ! 00
    lens 5          ! 010
    lens 7          ! 110
    

    Note that the bits are read from right to left. So the bits 10 0 from the first line and 11101 from the end of the second line combine to 11101100 which is your first byte EC.

    Following the header are infgen comments (which start with an exclamation mark) showing the literal/length and distance codes that result from that dynamic block header:

    ! litlen 9 10
    ! litlen 10 10
    ! litlen 13 10
    ! litlen 32 5
    ! litlen 33 10
    ! litlen 34 8
    ! litlen 35 10
    ! litlen 36 10
    ! litlen 37 10
    ! litlen 38 10
    ! litlen 39 10
    ! litlen 40 10
    ! litlen 41 10
    ! litlen 43 10
    ! litlen 44 8
    ! litlen 45 6
    ! litlen 46 7
    ! litlen 47 8
    ! litlen 48 6
    ! litlen 49 6
    ! litlen 50 6
    ! litlen 51 7
    ! litlen 52 6
    ! litlen 53 6
    ! litlen 54 7
    ! litlen 55 7
    ! litlen 56 7
    ! litlen 57 6
    ! litlen 58 7
    ! litlen 59 9
    ! litlen 60 8
    ! litlen 61 8
    ! litlen 62 8
    ! litlen 63 10
    ! litlen 64 10
    ! litlen 65 8
    ! litlen 66 10
    ! litlen 67 8
    ! litlen 68 8
    ! litlen 69 8
    ! litlen 70 8
    ! litlen 71 10
    ! litlen 72 10
    ! litlen 73 10
    ! litlen 74 10
    ! litlen 75 10
    ! litlen 76 10
    ! litlen 77 10
    ! litlen 78 10
    ! litlen 79 10
    ! litlen 80 10
    ! litlen 81 10
    ! litlen 82 10
    ! litlen 83 10
    ! litlen 84 8
    ! litlen 85 10
    ! litlen 86 10
    ! litlen 87 10
    ! litlen 88 10
    ! litlen 89 10
    ! litlen 90 10
    ! litlen 91 10
    ! litlen 93 10
    ! litlen 95 8
    ! litlen 97 5
    ! litlen 98 7
    ! litlen 99 6
    ! litlen 100 6
    ! litlen 101 5
    ! litlen 102 6
    ! litlen 103 8
    ! litlen 104 8
    ! litlen 105 6
    ! litlen 106 10
    ! litlen 107 8
    ! litlen 108 6
    ! litlen 109 7
    ! litlen 110 7
    ! litlen 111 6
    ! litlen 112 7
    ! litlen 113 10
    ! litlen 114 6
    ! litlen 115 6
    ! litlen 116 6
    ! litlen 117 7
    ! litlen 118 10
    ! litlen 119 8
    ! litlen 120 8
    ! litlen 121 8
    ! litlen 122 10
    ! litlen 123 10
    ! litlen 124 10
    ! litlen 125 10
    ! litlen 256 10
    ! litlen 257 3
    ! litlen 258 5
    ! litlen 259 5
    ! litlen 260 6
    ! litlen 261 6
    ! litlen 262 7
    ! litlen 263 7
    ! litlen 264 6
    ! litlen 265 6
    ! litlen 266 6
    ! litlen 267 7
    ! litlen 268 7
    ! litlen 269 7
    ! litlen 270 7
    ! litlen 271 7
    ! litlen 272 7
    ! litlen 273 5
    ! litlen 274 6
    ! litlen 275 7
    ! litlen 276 6
    ! litlen 277 6
    ! litlen 279 10
    ! litlen 280 10
    ! litlen 281 10
    ! litlen 282 10
    ! litlen 283 10
    ! litlen 284 10
    ! litlen 285 5
    ! dist 0 7
    ! dist 1 9
    ! dist 2 9
    ! dist 3 8
    ! dist 4 8
    ! dist 5 8
    ! dist 6 7
    ! dist 7 7
    ! dist 8 5
    ! dist 9 6
    ! dist 10 6
    ! dist 11 5
    ! dist 12 3
    ! dist 13 5
    ! dist 14 5
    ! dist 15 5
    ! dist 16 5
    ! dist 17 5
    ! dist 18 4
    ! dist 19 4
    ! dist 20 4
    ! dist 21 5
    ! dist 22 5
    ! dist 23 4
    ! dist 24 3
    ! dist 25 5
    ! dist 26 4
    ! dist 27 6
    ! dist 28 5
    ! dist 29 7
    

    That is then followed by the actual compressed data, along with the bits for those codes:

    literal 'M      ! 1100011111
    literal 'I      ! 1111101111
    literal 'M      ! 1100011111
    literal 'E      ! 10010111
    literal '-      ! 011010
    literal 'V      ! 1101011111
    literal 'e      ! 01100
    literal 'r      ! 110001
    literal 's      ! 001001
    literal 'i      ! 000001
    literal 'o      ! 010001
    literal 'n      ! 0010011
    literal ':      ! 1000011
    ...