How to rebuild the dynamic Huffman tree from DEFLATE
Asked Answered
M

3

5

This question is in regards to section 3.2.7 of RFC-1951, rebuilding the dynamic Huffman tree.

Each code is defined by a sequence of code lengths, such that all codes of a given bit length have lexicographically consecutive values.

For example, here is a rgb(255,0,0) 50x50 png, where the IDAT is a dynamic Huffman tree from DEFLATE.

0000024: xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx  CIDATx
000002a: xxxxxxxx 11101101 11001111 00110001 00010001 00000000  ...1..
0000030: 00000000 00001000 00000000 10100001 11101111 01011111  ....._
0000036: 01011010 00110011 10111000 01111010 00001100 00000100  Z3.z..
000003c: 10100000 10101001 11111001 00100000 00010001 00010001  ... ..
0000042: 00010001 00010001 00010001 00010001 00010001 00010001  ......
0000048: 00010001 00010001 00010001 00010001 00010001 00010001  ......
000004e: 00010001 00010001 00010001 00010001 00010001 00010001  ......
0000054: 00010001 00010001 00010001 00010001 00010001 00010001  ......
000005a: 00010001 00010001 00010001 00010001 00010001 00010001  ......
0000060: 00010001 00010001 00010001 00010001 00010001 10010001  ......
0000066: 10001011 00000101 10110000 00110011 01110101 10010110  ...3u.
000006c: 01111001 11000101 00011100 10110001 00000000 00000000  y.....
0000072: 00000000 00000000 01001001 01000101 01001110 01000100  ..

infgen produces this header:

last
dynamic
litlen 0 2
litlen 255 4
litlen 256 4
litlen 274 4
litlen 283 4
litlen 285 1
dist 3 1
dist 15 1

...the goal is to understand the bits and processes to rebuild the dynamic tree...


The first three bits describe the DEFLATE header.

101 <- last block bit, tree type is dynamic. 

The next fourteen bits describe the HLIT, HDIST, and HCLEN.

11101 <- HLIT,  29 + 257 = 286 
01111 <- HDIST, 15 + 1 = 16
1110 <- HCLEAN, 14 + 4 = 18

What do these values describe about the dynamic Huffman tree?


Next, reading three bits at a time and following the permutation table...the lengths are found to be...

Lengths: [4, 2, 4, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2] (line 697 of puff.c)

Are these lengths used to define the literals?

Marlin answered 4/11, 2018 at 19:7 Comment(0)
M
2

What do these values describe about the dynamic Huffman tree?

They don't really tell you much about the tree, but rather how many symbols for each type of code are described in the subsequent bits of the dynamic header.

There are eighteen code length code lengths provided next (three bits each), followed by 286 literal/length codes and then sixteen distance codes, all encoded using the code length code.

Are these lengths used to define the literals?

No. The three-bit lengths are for the code length codes. You need to build that code just to read the following literal/length and distance code lengths, which are themselves compressed using that code.

This is described in section 3.2.7 of RFC 1951:

"For even greater compactness, the code length sequences themselves are compressed using a Huffman code."

Markle answered 4/11, 2018 at 20:0 Comment(0)
M
5

More information...the next 14 bits are...

5 Bits of HLIT, the Literal/Length codes - 257 (257 - 286)
5 Bits of HDIST, the Distance codes - 1        (1 - 32)
4 Bits of HCLEN, the Code Length codes - 4     (4 - 19)

From the example:
  HLIT => 29 + 257 = 286, HDIST => 15 + 1 = 16, HCLEN => 14 + 4 = 18

Now gather 3-bit values HCLEN times. Following the order of the permutationt table (page 13 from RFC-1951.)

permuted ordering: [16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15]

From the example: 
   HCLEN is 18 => 18*3 = 54-bits.
   lengths: [ 4 2 4 0 2 0 0 0 0 0 0 0 0 0 0 0 0 3 2 ]

Next unpack those triple-bit values. This unpacking provides instructions on how to construct the dynamic Huffman tree… “For even greater compactness, the code length sequences themselves are compressed using a Huffman code.”

To unpack the triple-bit values: (examples below)
  1. Count the number occurrences of the values.
  2. Determine the offset, the count plus the previous offset.
  3. Determine the symbols. 
     A symbol is placed at the offset value, which is found at a length value. 
     After placing a symbol, increment the offset.


From the example: 
    LENGTHS: [ 4 2 4 0 2 0 0 0 0 0 0 0 0 0 0 0 0 3 2 ] //the three-bits from HCLEN
    COUNT: [ 13 0 3 1 2 0 0 0 0 0 0 0 0 0 0 0 ] //the occurrences of lengths.
      i.e, 0 occurs 13 times, 1 occurs 0 times, 2 occurs 3 times...
    OFFSET: [ x 0 0 3 4 6 6 6 6 6 6 6 6 6 6 6 ] //the count plus the previous offset.
      i.e, o[2] = 0 + 3, o[3] = 3 + 1, o[4] = 4 + 0...
    SYMBOL: [ 1 4 18 17 0 2 ] //use length to index offset to place symbol
      i.e, if i=1, s[off[len[i]]] = s[off[len[2]]] => s[off[0]] => s[0] = 1, increment off[0]...

...The symbols are now defined. Next decode bits to build the dynamic Huffman tree…

Marlin answered 6/11, 2018 at 0:11 Comment(0)
M
2

What do these values describe about the dynamic Huffman tree?

They don't really tell you much about the tree, but rather how many symbols for each type of code are described in the subsequent bits of the dynamic header.

There are eighteen code length code lengths provided next (three bits each), followed by 286 literal/length codes and then sixteen distance codes, all encoded using the code length code.

Are these lengths used to define the literals?

No. The three-bit lengths are for the code length codes. You need to build that code just to read the following literal/length and distance code lengths, which are themselves compressed using that code.

This is described in section 3.2.7 of RFC 1951:

"For even greater compactness, the code length sequences themselves are compressed using a Huffman code."

Markle answered 4/11, 2018 at 20:0 Comment(0)
V
1

The header contains bit lengths of the Deflate Literal (0-255), Length (256-285), and Distance codes (0-29). If you know their bit lengths you can rebuild the tree following the algorithm in Section 3.2.2 of RFC1951. Look for the steps that start with "Count the number of codes for each code length..."

However, the bit length values are also encoded with Huffman using 0 to 7 bits. You need first crack open the HCLEN table at the beginning using the same algorithm as above. Section 3.2.7 explains it.

Why encode the header twice as such? To make the dynamic Huffman header small.

Varied answered 5/11, 2018 at 16:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.