My problem is similar to the one solved HERE, with few differences.
So the a Deflated block with dynamic Huffman codes is structured like this:
|Block Header|compressed block|end of block|
Then Block header's bit stream is like this:
|block type|HLIT|HDIST|HCLEN|
So i ran Pyflate and i got the following
Dynamic Huffman tree:
length codes: 21, distances codes: 24, code_lengths_length: 16
lengths:
[3, 0, 6, 6, 5, 3, 3, 2, 3, 3, 0, 0, 0, 0, 0, 0, 5, 6, 6]
lengths to spans:
[(0, 3), (1, 0), (2, 6), (3, 6), (4, 5), (5, 3), (6, 3), (7, 2), (8, 3), (9, 3), (10, 0), (11, 0), (12, 0), (13, 0), (14, 0), (15, 0), (16, 5), (17, 6), (18, 6), (19, -1)]
From this byte stream:
EC 5D 69 73 AB 36 14 FD DC CE F4 3F A8 E9 3E 0D 0E 60 B0 8D 5F 92 36 CD D2 7D 4F F7 E9 64 64 10 09 0D 36 2E E0 26 E9 F2 DF 7B 24 B0 01 AF C2 30 5D E5 F7 92 D8 02 1D 6D F7 5E E9 1E 5D E4 8F DF FF F8 52 FB 9A C5 49 10 4D 86 C4 E8 E8 2F 3C 7F 41 53 36 24 5F CE 26 87 C4 B4 C9 15 1B 11 53 37 2D FC 1A 76 AD A1 35 20 6F EA A6 8E DB AE E2 68 3C 24 67 61 30 A2 23 DA 71 A3 31 39 A6 D9 87 B7 13 16 FF 12 B8 AC 43 8B 8B A7 2F 3C FF E5 6C F4 13 73 D3 21 B9 0A 26 1E 49 EF 18 89 83 DB BB 94 4C E3 C8 9B B9 69 42 FC 28 26 4F D1 2C 26 A3 59 12 4C 58 92 BC F0 FC F5 5D CC A8 A7 5D 47 D3 C0 AD 95 F1 63 FC A6 B7 4C 7B FF 62 48 8E 7B B6 37 1A 0C AC 41 A7 D7 D5 A9 6E 9A 7A A7 EF E8 D4 E9 F4 BD BE FF E5 C7 D7 9F BD FF C9 CD D9 C5 C5 E5 C5 CD 3B 5F 7C FA E1 E5 27 6F 8F 1F 3B B7 51 74 1B B2 BC E6 D7 D1 90 1C A4 3E 9D DC 87 C1 44 7F FB 76 4C 83 90 5F 3A 20 C7 4B A9 79 86 F3 68 92 B2 49 AA 5D C7 74 92 F8 2C D6 2E 27 6E E4 05 93 DB 21 F9 79 16 A5 CC D3 A6 71 30 49 E9 28 64 A5 9B 9F A6 E8 F7 94 3D A6 47 77 E9 38 7C 46 DC 3B 1A 27 2C 3D 39 98 A5 BE 36 38 78 E1 79 FE EF 98 5F 43 11 E4 F8 0E 3D 73 7A 62 EA 78 4F F2 3F C7 2F 6A DA 0F 81 4F 5E 1C 27 D1 8F A7 FC D3 E9 FC CA 98 A5 94 B8 59 51 27 DD 8B 83 F7 2F F1 9B 79 B7 EC E0 B4 94 F9 F8 C5 1F D8 C4 0B FC 1F 8B 9C 15 80 09 1D 33 9E FB 97 80 3D 4C A3 38 3D 28 43 3E 04 5E 7A 87 37 1E E3 A3 AF 89 8F 87 24 98 04 69 40 43 2D 71 69 C8 F3 9E BC F0 BC 51 94 99 A4 4F 21 23 E9 D3 54 C0 8A C6 BB 49 72 20 EA FE C2 F3 2F 45 B3 34 8C A2 7B 42 5F 78 FE B9 DF A6 D4 13 7D A8 3F FB E3 85 E7 47 91 F7 C4 13 45 29 43 43 D7 5F 79 31 18 F3 2A D1 49 FA 0C 17 C6 C1 44 2B AE 89 14 1A DF 06 13 E4 C6 FB 1C 2A FB A0 3D B0 D1 7D 90 6A BC 74 2D 09 7E 65 1A F5 7E 9A 25 A9 C8 C9 CB 12 03 75 48 52 8F 17 38 8A 62 0F 23 EA 46 61 48 A7 09 1B CE DF 88 22 92 48 13 37 6B 61 32 A5 2E 1B EA D3 B4 9A 1E 2F D2 39 70 30 BE 2D 20 B3 CA DC 31 2E DB 43 3A 4B 23 FE 31 84 3C 6B 59 DA A2 21 E8 14 9E 3C 9C 44 13 51 AA A8 B7 C7 DC 28 A6 29 D4 58 A4 73 F4 29 C7 F6 82 64 1A D2 A7 E1 28 8C DC FB 72 37 4C 1F F9 3D 9D 2F 20 45 1F 27 B7 EF AC 74 A7 B8 4A 9F C6 37 A2 E2 37 0F 31 9D 6E EA 6F 7E EB 4B 23 EA DE DF C6 D1 6C E2 5D F3 0C FC DE 8D 3D BE 0E 64 B9 B5 AB 45 74 2E 1F 53 16 4F 68 78 1E D2 24 39 24 D5 CF 64 BA 92 82 BE 9E AC 24 FA 10
But i can't find the bytes corresponding to Pyflate's output except for 21
which is first 5 bits of byte EC
and Pyflate did actually decompress the data successfully.
I need to do this manually myself without third party software to better understand it, So i know about RFC1951 but it didn't help on this one.
How does one correctly extract Huffman tree from compressed data?
Or if someone can provide an idea of how the block is really structure because clearly the above structure is just an overview.
My goal is to understand how much space does the Huffman data covers before the actual data from the file being compressed.
So i know about RFC1951 but it didn't help on this one.
RFC 1951 completely and exactly describes the deflate format, so it is not possible that it can't help. Unless, that is, you have not spent the time and effort to read and understand it.
Then Block header's bit stream is like this:
|block type|HLIT|HDIST|HCLEN|
The dynamic block header starts like that. The header continues for many more bits. For that example, the dynamic header is 495 bits. A little less than 62 bytes.
Along with studying RFC 1951, you can use infgen to disassemble example streams and see the pieces. Here is the complete dynamic block header from your stream as shown by infgen, including the bits from the stream on the right (using the -dd
option of infgen):
! infgen 3.2 output
!
dynamic ! 10 0
count 286 30 14 ! 1010 11101 11101
code 16 4 ! 100
code 17 6 ! 110
code 18 6 ! 110
code 0 4 ! 100
code 8 3 ! 011
code 7 3 ! 011
code 9 5 ! 101
code 6 2 ! 010
code 10 3 ! 011
code 5 3 ! 011
code 4 5 ! 101 000
code 3 5 ! 101 000
zeros 9 ! 110 011111
lens 10 ! 101
lens 10 ! 101
lens 0 ! 0011
lens 0 ! 0011
lens 10 ! 101
zeros 18 ! 0000111 111111
lens 5 ! 010
lens 10 ! 101
lens 8 ! 001
lens 10 ! 101
repeat 6 ! 11 1011
lens 0 ! 0011
lens 10 ! 101
lens 8 ! 001
lens 6 ! 00
lens 7 ! 110
lens 8 ! 001
lens 6 ! 00
lens 6 ! 00
lens 6 ! 00
lens 7 ! 110
lens 6 ! 00
lens 6 ! 00
lens 7 ! 110
lens 7 ! 110
lens 7 ! 110
lens 6 ! 00
lens 7 ! 110
lens 9 ! 01111
lens 8 ! 001
lens 8 ! 001
lens 8 ! 001
lens 10 ! 101
lens 10 ! 101
lens 8 ! 001
lens 10 ! 101
lens 8 ! 001
repeat 3 ! 00 1011
lens 10 ! 101
repeat 6 ! 11 1011
repeat 6 ! 11 1011
lens 8 ! 001
lens 10 ! 101
repeat 6 ! 11 1011
lens 0 ! 0011
lens 10 ! 101
lens 0 ! 0011
lens 8 ! 001
lens 0 ! 0011
lens 5 ! 010
lens 7 ! 110
lens 6 ! 00
lens 6 ! 00
lens 5 ! 010
lens 6 ! 00
lens 8 ! 001
lens 8 ! 001
lens 6 ! 00
lens 10 ! 101
lens 8 ! 001
lens 6 ! 00
lens 7 ! 110
lens 7 ! 110
lens 6 ! 00
lens 7 ! 110
lens 10 ! 101
lens 6 ! 00
lens 6 ! 00
lens 6 ! 00
lens 7 ! 110
lens 10 ! 101
lens 8 ! 001
lens 8 ! 001
lens 8 ! 001
lens 10 ! 101
repeat 3 ! 00 1011
zeros 130 ! 1110111 111111
lens 10 ! 101
lens 3 ! 00111
lens 5 ! 010
lens 5 ! 010
lens 6 ! 00
lens 6 ! 00
lens 7 ! 110
lens 7 ! 110
lens 6 ! 00
lens 6 ! 00
lens 6 ! 00
lens 7 ! 110
repeat 5 ! 10 1011
lens 5 ! 010
lens 6 ! 00
lens 7 ! 110
lens 6 ! 00
lens 6 ! 00
lens 0 ! 0011
lens 10 ! 101
repeat 5 ! 10 1011
lens 5 ! 010
lens 7 ! 110
lens 9 ! 01111
lens 9 ! 01111
lens 8 ! 001
lens 8 ! 001
lens 8 ! 001
lens 7 ! 110
lens 7 ! 110
lens 5 ! 010
lens 6 ! 00
lens 6 ! 00
lens 5 ! 010
lens 3 ! 00111
lens 5 ! 010
repeat 4 ! 01 1011
lens 4 ! 10111
lens 4 ! 10111
lens 4 ! 10111
lens 5 ! 010
lens 5 ! 010
lens 4 ! 10111
lens 3 ! 00111
lens 5 ! 010
lens 4 ! 10111
lens 6 ! 00
lens 5 ! 010
lens 7 ! 110
Note that the bits are read from right to left. So the bits 10 0
from the first line and 11101
from the end of the second line combine to 11101100
which is your first byte EC
.
Following the header are infgen comments (which start with an exclamation mark) showing the literal/length and distance codes that result from that dynamic block header:
! litlen 9 10
! litlen 10 10
! litlen 13 10
! litlen 32 5
! litlen 33 10
! litlen 34 8
! litlen 35 10
! litlen 36 10
! litlen 37 10
! litlen 38 10
! litlen 39 10
! litlen 40 10
! litlen 41 10
! litlen 43 10
! litlen 44 8
! litlen 45 6
! litlen 46 7
! litlen 47 8
! litlen 48 6
! litlen 49 6
! litlen 50 6
! litlen 51 7
! litlen 52 6
! litlen 53 6
! litlen 54 7
! litlen 55 7
! litlen 56 7
! litlen 57 6
! litlen 58 7
! litlen 59 9
! litlen 60 8
! litlen 61 8
! litlen 62 8
! litlen 63 10
! litlen 64 10
! litlen 65 8
! litlen 66 10
! litlen 67 8
! litlen 68 8
! litlen 69 8
! litlen 70 8
! litlen 71 10
! litlen 72 10
! litlen 73 10
! litlen 74 10
! litlen 75 10
! litlen 76 10
! litlen 77 10
! litlen 78 10
! litlen 79 10
! litlen 80 10
! litlen 81 10
! litlen 82 10
! litlen 83 10
! litlen 84 8
! litlen 85 10
! litlen 86 10
! litlen 87 10
! litlen 88 10
! litlen 89 10
! litlen 90 10
! litlen 91 10
! litlen 93 10
! litlen 95 8
! litlen 97 5
! litlen 98 7
! litlen 99 6
! litlen 100 6
! litlen 101 5
! litlen 102 6
! litlen 103 8
! litlen 104 8
! litlen 105 6
! litlen 106 10
! litlen 107 8
! litlen 108 6
! litlen 109 7
! litlen 110 7
! litlen 111 6
! litlen 112 7
! litlen 113 10
! litlen 114 6
! litlen 115 6
! litlen 116 6
! litlen 117 7
! litlen 118 10
! litlen 119 8
! litlen 120 8
! litlen 121 8
! litlen 122 10
! litlen 123 10
! litlen 124 10
! litlen 125 10
! litlen 256 10
! litlen 257 3
! litlen 258 5
! litlen 259 5
! litlen 260 6
! litlen 261 6
! litlen 262 7
! litlen 263 7
! litlen 264 6
! litlen 265 6
! litlen 266 6
! litlen 267 7
! litlen 268 7
! litlen 269 7
! litlen 270 7
! litlen 271 7
! litlen 272 7
! litlen 273 5
! litlen 274 6
! litlen 275 7
! litlen 276 6
! litlen 277 6
! litlen 279 10
! litlen 280 10
! litlen 281 10
! litlen 282 10
! litlen 283 10
! litlen 284 10
! litlen 285 5
! dist 0 7
! dist 1 9
! dist 2 9
! dist 3 8
! dist 4 8
! dist 5 8
! dist 6 7
! dist 7 7
! dist 8 5
! dist 9 6
! dist 10 6
! dist 11 5
! dist 12 3
! dist 13 5
! dist 14 5
! dist 15 5
! dist 16 5
! dist 17 5
! dist 18 4
! dist 19 4
! dist 20 4
! dist 21 5
! dist 22 5
! dist 23 4
! dist 24 3
! dist 25 5
! dist 26 4
! dist 27 6
! dist 28 5
! dist 29 7
That is then followed by the actual compressed data, along with the bits for those codes:
literal 'M ! 1100011111
literal 'I ! 1111101111
literal 'M ! 1100011111
literal 'E ! 10010111
literal '- ! 011010
literal 'V ! 1101011111
literal 'e ! 01100
literal 'r ! 110001
literal 's ! 001001
literal 'i ! 000001
literal 'o ! 010001
literal 'n ! 0010011
literal ': ! 1000011
...