What is the maximum size of encoded dynamic Huffman tree as used by DEFLATE (zlib, gzip) format?
Asked Answered
B

3

7

Section "3.2.7. Compression with dynamic Huffman codes (BTYPE=10)" of https://www.ietf.org/rfc/rfc1951.txt describes encoding of dynamic Huffman tree used during compression. What is the maximum size (in bits) of such encoded Huffman tree representation as may appear in DEFLATE bitstream? Extra points for backing a particular figure with external reference ;-).

This is a theoretical question to understand DEFLATE properties better. But of course it has practical applications, for example, "How big buffer should be used to guaranteedly decode Huffman tree?"

Battalion answered 21/8, 2014 at 16:18 Comment(0)
C
9

An upper bound on the length of a dynamic block header can be readily computed from the reference you already provided. From RFC 1951, section 3.2.7 we can add up the bits:

3 + 5 + 5 + 4 + 19 * 3 + (286 + 30) * 7 = 2286 bits = 285.75 bytes.

(See calculation notes below for details.)

In practice you will never see one near as long as 286 bytes. More typical lengths are 60 to 90 bytes.

Here is a histogram of dynamic header block lengths from a gzipped source distribution of linux, linux-3.1.6.tar.gz:

dynamic block length histogram

They don't all look the same. Here is another for Archive.pax.gz (an application distribution):

another dynamic block length histogram

The bimodal shape is probably executables vs. text. Executables code all literal byte values, resulting in larger dynamic headers to describe codes for all of those values.


Calculation notes:

I deliberately did not add possible extra bits for symbols 16, 17, or 18, because the use of any of those codes, including their extra bits, would reduce the length of the header, not increase it. A 16 symbol would replace 21 to 42 bits with 9 bits, a 17 symbol would replace 21 to 70 bits with 10 bits, and an 18 symbol would replace 77 to 966 bits with 14 bits (where all symbols are assumed to be seven bits).

There are still 19 initial code lengths even if 16, 17, and 18 are not used, since those are stored first.

I limited the literal/length code lengths to 286 and the distance code lengths to 30, since a compliant inflator will reject values above that.

2286 is the lowest possible upper bound, since there is no constraint in the deflate format that the header be constructed to be optimal. It is possible to construct the code lengths code to, for example, have the lengths 4, 5, 8, and 9 all be 7-bit codes, and then use only those in the list of lengths to construct complete literal/length and distance codes. The code lengths code must also be complete, but that can be achieved by assigning shorter codes to unused lengths.

In short, it is possible to construct an entirely valid dynamic block header that is 2286 bits in length. In fact, here's one (there are many ways to do this):

ed fd 01 e0 38 70 1c 28 a7 fc 7e bf df ef f7 fb
fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e
bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df
ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef f7
fb fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb fd
7e bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e bf
df ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef
f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb
fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e
bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df
ef f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef f7
fb fd 7e bf df ef f7 fb fd 7e bf df ef f7 fb fd
7e bf df ef f7 fb fd 7e bf df ef f7 fb fd 7e ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ff ff ff ff f9 7c bf df ef f7 fb fd 7e bf df ef
f7 fb fd 7e bf df ef f7 fb fd 7e bf df ef 23

That is a valid and complete deflate stream represented in hexadecimal. It consists of a single dynamic block, marked as the last block, with a 2286-bit dynamic header and a 9-bit end-of-block code, totaling 2295 bits, rounding up to 287 bytes. It decompresses to zero bytes with no errors.

Cupellation answered 22/8, 2014 at 19:39 Comment(6)
Thanks for coming by - I wondered if it's possible to summon you to the question somehow ;-). And sorry for confusion in the question formulation - I was asking not about maximum length of a single Huffman code as used by DEFLATE, but about maximum cumulative size of Huffman tree representation as encoded at the beginning of a dynamic Huffman block.Battalion
Thanks for the graphs! I indeed did my own calculation first, which I intended to post as an answer if nobody provides better answer. But with it, I see that your conservative estimate is a bit too optimistic ;-). Let me post my calculation, in the hope that we can find more precise bound.Battalion
Could you share how you generated those graphs? I'm curious whether you wrote code to parse the bits yourself, or if there's some utility that can extract this kind of metadata for free.Marylyn
I post-processed the output of infgen. So yes, I wrote the code myself since I wrote infgen, and yes, infgen is freely available.Cupellation
Just to say, I think the calculation should include 1+2 bits (if any) for the block type header, because of the single bit before it to denote whether it's the last block. The three bits are pretty much inseparable.Latifundium
@Latifundium Added it. Thanks.Cupellation
F
3

The answer is 5 + 5 + 4 + 19 * 3 + (286 * 7) + (30 * 7) = 2283 bits excluding the 2 bit BTYPE field.

Zlib will reject Literal/Length symbols 286,287 and Distance symbols 30,31. Therefore, 286 and 30 are the proper counts in the sum, not 288 and 32.

Fantoccini answered 3/6, 2017 at 1:43 Comment(1)
Good point. This weighs in on the question of how big the buffer should be - it depends on some implementation choices you might make. Iff you restrict the lengths of the tables, you can slightly reduce the buffer size needed.Latifundium
B
1

I would like to elaborate on Mark's calculation above. Hope it'll be readable:

2 (block type, not a huffman tree part per se) + 5 (HLIT from spec) + 5 (HDIST) + 4 (HCLEN) + 19 (max HCLEN) * 3 (bits per HCLEN) + (288 + 32) * (7 + 7) = 4551 bits

Differences from Mark's formula:

  • I'm using 288 as literal/length alphabet size (2 codes can't be used as symbols, that's why Mark uses 286, but then 2 codes can't be used for distances too, and yet he uses 32, not 30. I don't understand constraints when invalid codes should be counted and when not, so I go conservative and always count them).

  • Where Mark's formula counts max 7 bits per code length, mine has (7 + 7). That's because some length codes (value 18 specifically) can read 7 extra bits from stream.

However, thinking about last point, the code which reads extra 7 bits has special semantics - repeating zero encoding for number (11+) of length codes. In other words, if 7+7 code was read, it replaces at least 11*7 of literal code length codes. So, my estimate of 4551 bits is naively overstated, where Mark's estimate provides much better upper bound.

But it's still conservative - it's implausible for Huffman codes to have all maximum length. If distribution is even, Huffman lengths should be average. OTOH, it's apparently possible to have distribution that one symbol will have Huffman length of 1, while the rest will hover around maximum length. So, exact upper bound should not be much lower than conservative given by Mark. So, if someone can provide this exact upper bound, it would be cool. Otherwise, Mark's bound is already pretty good for practical application of buffering.

Battalion answered 23/8, 2014 at 13:46 Comment(4)
You're right, it should be 288. I will edit my answer. I got the 286 from this line in the RFC: 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286), but that range is not correct. It should be (257 - 288) since 257+31 = 288.Cupellation
As you note, you should not include extra bits in the upper bound calculation, since they can only make the header shorter, not longer.Cupellation
The upper bound length is implausible, but permitted by the standard. If the Huffman code is computed dynamically based on the frequencies of the symbols, then yes, the upper bound is impossible. However there is nothing in the RFC that says you have to do that. A compliant compressor could, for example, have a constant preset Huffman code for the dynamic header, and the header information could happen to use only the long codes.Cupellation
I have added these comments to the answer, albeit four years later.Cupellation

© 2022 - 2024 — McMap. All rights reserved.