Why modulo 65521 in Adler-32 checksum algorithm?
Asked Answered
L

6

17

The Adler-32 checksum algorithm does sums modulo 65521. I know that 65521 is the largest prime number that fits in 16 bits, but why is it important to use a prime number in this algorithm?

(I'm sure the answer will seem obvious once someone tells me, but the number-theory parts of my brain just aren't working. Even without expertise in checksum algorithms, a smart person who reads http://en.wikipedia.org/wiki/Fletcher%27s_checksum can probably explain it to me.)

Lyford answered 29/5, 2009 at 17:52 Comment(1)
See the paper linked from Wikipedia: zlib.net/maxino06_fletcher-adler.pdf It seems that using a prime number (65521) is actually not better than using, say 65536: "while the prime modulus in the Adler checksum results in better mixing, there are fewer “bins” (i.e., valid FCS values) available for code words. In most cases, this reduction in bins outweighs the gains made by better mixing."Reproachless
B
20

Why was mod prime used for Adler32?

From Adler's own website http://zlib.net/zlib_tech.html

However, Adler-32 has been constructed to minimize the ways to make small changes in the data that result in the same check value, through the use of sums significantly larger than the bytes and by using a prime (65521) for the modulus. It is in this area that some analysis is deserved, but it has not yet been done.

The main reason for Adler-32 is, of course, speed in software implementations.

An alternative to Adler-32 is Fletcher-32, which replaces the modulo of 65521 with 65535. This paper shows that Fletcher-32 is superior for channels with low-rate random bit errors.

It was used because primes tend to have better mixing properties. Exactly how good it is remains to be discussed.

Other Explanations

Someone else in this thread makes a somewhat convincing argument that modulus a prime is better for detecting bit-swapping. However, this is most likely not the case because bit-swapping is extremely rare. The two most prevalent errors are:

  1. Random bit-flips (1 <-> 0) common anywhere.
  2. Bit shifting (1 2 3 4 5 -> 2 3 4 5 or 1 1 2 3 4 5) common in networking

Most of the bit-swapping out there is caused by random bit-flips that happened to look like a bit swap.

Error correction codes are in fact, designed to withstand n-bits of deviation. From Adler's website:

A properly constructed CRC-n has the nice property that less than n bits in error is always detectable. This is not always true for Adler-32--it can detect all one- or two-byte errors but can miss some three-byte errors.

Effectiveness of using a prime modulus

I did a long writeup on essentially the same question. Why modulo a prime number?

http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth

The short answer

We know much less about prime numbers than composite ones. Therefore people like Knuth started using them.

While it might be true that primes have less relationship to much of the data we hash, increasing the table/modulo size also decreases the probability of a collision (sometimes more than any benefit gained from rounding down to the nearest prime).

Here is a graph of collisions per bucket with 10 million cryptographically random integers comparing mod 65521 vs 65535.

Brandibrandice answered 9/6, 2009 at 2:52 Comment(7)
You're only considering the simple case of sums modulo a number. Obviously there will be fewer collisions when you take random numbers and mod them to a larger number, prime or not. For something useful, compare with multiplication instead, or, for this question, compare the Fletcher/Adler-32 checksum algorithm.Reproachless
Oops, the above comment was made in haste; I see you did compare Fletcher v/s Adler-32. Good work :) It is not surprising that simply taking mod a smaller number gives fewer collisions; there are more buckets after all. I do object to your statement that "We know much less about prime numbers than composite ones. Therefore people like Knuth started using them", though. You make it sound like voodoo. Knuth doesn't do voodoo. :P I'm pretty certain what's going on here is people hurriedly reading TAOCP and misinterpreting it.Reproachless
@ShreevatsaR: well many people would like to believe Knuth is infallible, but Paul Hsieh's quote seems to indicate that he rewrote it to remove the mod P: "I showed him examples where modulo by prime had extremely bad collisions, but this did not seem to sway him at all. It seems Knuth’s rewrites have come too late."Brandibrandice
Yes, but it seems he's as vague about what parts of Knuth and what rewrites, as his boss presumably was. It's much more likely that the boss just pointed vaguely and said "look, primes!", and Paul Hsieh is just vaguely saying "look, 30-year-old book". :) (I upvoted this answer, BTW.)Reproachless
@ShreevatsaR: thanks for the upvote. Also good find on the Maxino article that I somehow missed. I just needlessly duplicated a finding.Brandibrandice
The meaning of the collisions graph isn't clear to me. Right now it looks like 'buckets per collision' which can't be right.Spellbound
The answer was edited to remove the now-dead link to the graph, but the "Here is a graph" remains. Can the graph be restored? If not, the answer should be edited to remove the obsolete paragraph.Birecree
C
4

The Adler-32 algorithm is to compute

A = 1 + b1 + b2 + b3 + ...

and

B = (1 + b1) + (1 + b1 + b2) + (1 + b1 + b2 + b3) + ... = 1 + b1 + 2 * b2 + 3 * b3 + ...

and report them modulo m. When m is prime, the numbers modulo m form what mathematicians call a field. Fields have the handy property that for any nonzero c, we have a = b if and only if c * a = c * b. Compare the times table modulo 6, which is not a prime, with the times table modulo 5, which is:

* 0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 1 2 3 4 5
2 0 2 4 0 2 4
3 0 3 0 3 0 3
4 0 4 2 0 4 2
5 0 5 4 3 2 1

* 0 1 2 3 4
0 0 0 0 0 0
1 0 1 2 3 4
2 0 2 4 1 3
3 0 3 1 4 2
4 0 4 3 2 1

Now, the A part gets fooled whenever we interchange two bytes -- addition is commutative after all. The B part is supposed to detect this kind of error, but when m is not a prime, more locations are vulnerable. Consider an Adler checksum mod 6 of

1 3 2 0 0 4

We have A = 4 and B = 1. Now consider swapping b2 and b4:

1 0 2 3 0 4

A and B are unchanged because 2 * 3 = 4 * 0 = 2 * 0 = 4 * 3 (modulo 6). One can also swap 2 and 5 to the same effect. This is more likely when the times table is unbalanced -- modulo 5, these changes are detected. In fact, the only time a prime modulus fails to detect a single swap is when two equal indexes mod m are swapped (and if m is big, they must be far apart!).^ This logic can also be applied to interchanged substrings.

The disadvantage in using a smaller modulus is that it will fail slightly more often on random data; in the real world, however, corruption is rarely random.

^ Proof: suppose that we swap indexes i and j with values a and b. Then ai + bj = aj + bi, so ai - aj + bj - bi = 0 and (a - b)*(i - j) = 0. Since a field is an integral domain, it follows that a = b (values are congruent) or i = j (indexes are congruent).

EDIT: the website that Unknown linked to (http://www.zlib.net/zlib_tech.html) makes it clear that the design of Adler-32 was not at all principled. Because of the Huffman code in a DEFLATE stream, even small errors are likely to change the framing (because it's data-dependent) and cause large errors in the output. Consider this answer a slightly contrived example for why people ascribe certain properties to primes.

Counterfactual answered 9/6, 2009 at 3:39 Comment(4)
Sorry, but your conclusion is incorrect. 1,3,2,0,5,4 -> 1,3,2,5,0,4 % 5 under equation B of adler-32 gives 3 for both. While using mod 6, it gives 1 and 0 respectively. The indexes here swapped are 4 and 5, both of which are not "two equal indexes mod m", therefore your conjecture is disproved.Brandibrandice
Not disproved -- they're congruent mod 5. I changed the example.Counterfactual
(m should be larger than the maximum input value.)Counterfactual
BTW, thanks for the downvotes on my unrelated questions? imgur.com/xTa3t.pngBrandibrandice
G
3

Long story short:

The modulo of a prime has the best bit-shuffeling properties, and that's exactly what we want for a hash-value.

Grigri answered 29/5, 2009 at 18:42 Comment(1)
What is it about the modulo of a prime that makes it a better bit shuffler?Lyford
T
1

For perfectly random data, the more buckets the better.

Let's say the data is non-random in some way. Now, the only way that the non-randomness could affect the algorithm is by creating a situation where some buckets have a higher probability of being used than others.

If the modulo number is non-prime, then any pattern affecting one of the numbers making up the modulo could affect the hash. So if you're using 15, a pattern every 3 or 5 as well as every 15 could cause collisions, while if you're using 13 the pattern would have to be every 13 to cause collisions.

65535 = 3*5*17*257, so a pattern involving 3 or 5 could cause collisions using this modulo-- if multiples of 3 were much more common for some reason, for instance, then only the buckets which were multiples of 3 would be put to good use.

Now I'm not sure whether, realistically, this is likely to be an issue. It would be good to determine the collision rate empirically with actual data of the type one wants to hash, not random numbers. (For instance, would numerical data involving http://en.wikipedia.org/wiki/Benford's_law">Benford's Law or some such irregularity cause patterns that would affect this algorithm? How about using ASCII codes for realistic text?)

Tace answered 9/6, 2009 at 19:13 Comment(0)
R
0

The answer lies in the field theory. The set Z/Z_n with the operations plus und times is a field when n is a prime (i.e. addition und multiplication with modulo n).

In other words, the following equation:

m * x = (in Z/Z_n) 

has only one solution for any value ofm (namely x = 0)

Consider this example:

2 * x = 0 (mod 10)

This equation has two solutions, x = 0 AND x = 5. That is because 10 is not a prime and can be written as 2 * 5.

This property is responsible for better distribution of the hash values.

Ribaudo answered 29/5, 2009 at 17:52 Comment(0)
M
0

Checksums are generally used with the intention of detecting that two things are different, especially in cases where both things are not available at the same time and place. They might be available at different places (e.g. a packet of information as sent, versus a packet of information as received), or different times (e.g. a block of information when it was stored, versus a block of information when it was read back). In some cases, it may be desirable to check whether two things that are stored independently in two different places are likely to match, without having to send the actual data from one device to the other (e.g. comparing loaded code images or configurations).

If the only reasons that the things being compared wouldn't match would be random corruption of one of them, then the use of a prime modulus for an Adler-32 checksum is probably not particularly helpful. If, however, it's possible that one of the things might have had some 'deliberate' changes made to it, use of a non-prime modulus may cause certain changes to go unnoticed. For example, the effects of changing a byte from 00 to FF, and changing another byte that's some multiple of 257 bytes earlier or later from FF to 00, would cancel out when using a Fletcher's checksum, but not when using Adler-32 checksum. It's not particularly likely that such a scenario would occur from random corruption, but such offsetting changes could occur when changing a program. It wouldn't be especially likely that they'd occur an exact multiple of 257 bytes apart, but it's a risk which can be avoided by using a prime modulus (provided, at least, that the number of bytes in the file is smaller than the modulus)

Marketa answered 29/5, 2009 at 17:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.