How reliable is the adler32 checksum?
Asked Answered
G

5

23

I wonder how reliable the adler32 checksum is, compared to e.g. md5 checksums? It was told on wikipedia that adler32 is "much less reliable" than md5, so I wonder how much, and in which way?

More specifically, I'm wondering if it is reliable enough as a consistency check for long-time archiving of (tar) files of size 20GB+?

Gonzales answered 18/5, 2011 at 11:9 Comment(0)
S
19

For details on the error-checking capabilities of the Adler-32 checksum, see for example Revisiting Fletcher and Adler Checksums. Maxino, 2006.

This paper contains an analysis on the Hamming distance provided by these two checksums, and provides an indication of the residual error rate for data words up to about 2^11 bits. Which, obviously is much less than your requirement of 2^38 bits...

Spondylitis answered 18/5, 2011 at 11:43 Comment(1)
Thanks, that's a useful paper. Would be interesting to se calculations of word lenghts up to 20Gb+ ... interesting to see if there will be occur the similar treshold effect, with a sharply increasing rate of undetected errors, somewhere before that, for the adler32 algo too ...Gonzales
P
17

Adler32 has an entirely different purpose than MD5. Adler32 is a checksum. MD5 is a secure message digest. Adler32 is for quick hashes, has a small bit space, and simple algorithm. Its collision rate is low, but not low enough to be secure. MD5, SHA, and other cryptographic/secure hashes (or message digests) have much larger bitspaces and more complex algorithms, thus have far fewer collisions. Compare SHA2-256, for example; 256 bits compared to Adler32's measly 32 bits.

Adler does have its purpose, in hash tables for instance, or rapid data integrity checks. Still, it is not designed with the same purpose as MD5 or other secure digests.

BTW, if a simple but somewhat reliable checksum is what you need, then it seems Fletcher out-performs Adler. I'd speculate they both out-perform CRC, though perhaps not a simple addition based checksum (though it is very prone to collisions). If you want BOTH performance AND security, then use BOTH algorithms. Have the checksum algorithm used as a quick calculation and lookup, then use the larger digest for a more thorough confirmation if found.

To answer your question on ensuring the validity of archives, I would say that it would probably suffice just fine. Best choice? Questionable. Possibility of error? Very low.

Pilloff answered 23/9, 2012 at 10:43 Comment(1)
Now that I understand CRC checks, I'd expect a good CRC code to outperform Adler32. CRC has a huge galois field theory backing them up: all good CRC are immune to odd-bit errors (1-bit, 3-bit, 5-bit...), immune to 2-bit errors, immune to almost all burst errors less than its size (ex: CRC32 is immune against a 32-long burst error). There's exactly one 32-bit burst error that the CRC32 cannot defend against: and that's the CRC32 polynomial itself (or in the case of Ethernet: a burst error of exactly 0x1EDC6F41). All other 32-burst errors are protected. No such theory exists for AdlerCumulonimbus
I
4

This is an ancient algorithm; one which, as the Wikipedia page says, "trades accuracy for speed". In short, no, you shouldn't rely on it.

The point is that with multiple corruptions, this checksum might still pass as "okay". Due to the avalanche effect, this is significantly less likely to occur in modern algorithms (even the old MD5).

For today's machines, speed is not so much of a concern, therefore I'd suggest using a modern algorithm (whichever is current), even for files in the TB range. The insignificant time savings you'd get with an old checksum system are IMHO not enough to balance the significantly increased risk of undetected data corruption - and honestly, 20GB of files is not that much data these days that you'd need to use weak (and I daresay broken) algorithms.

Interfertile answered 18/5, 2011 at 11:19 Comment(3)
I don't think Adler-32 can be "broken" because it doesn't appear to have ever been meant for security purposes. Everywhere I've seen it mentioned in the past ten years has always referred to it as what you use if you want to check that a file hasn't been accidentally derped. Additionally, "todays machines" might not be in use. There are microcontrollers or real-time applications that might have something more important to do with their time than calculate 512-bit hashes.Phenanthrene
@Skrylar: Fair enough; note that I have never mentioned security purposes. "Broken" has also the meaning "malfunctioning", i.e. "not performing its function correctly, e.g. reporting errors where there are none or reporting no errors despite their presence"; note that my answer says so. That said, did you actually read the answer (never mind the question, namely its last sentence), or did you just see "modern machines" and jumped in with "ooh, ooh, I know, I know: what if it runs on a RT toaster"?Interfertile
As for checksumming an archive for storage, I would call that insufficient: "Checking...yup, it's broken." Now what? The archive also needs to be resilient, see e.g. en.wikipedia.org/wiki/ParchiveInterfertile
D
3

It is less reliable than say MD5 or CRC (about the same as CRC actually). Advantage is speed, disadvantage is more showing for short data (few hundred bytes) - the meaning is that the distribution of hash values does not cover very well the available 32bit output. For big files it is a good choice.

Dall answered 18/5, 2011 at 11:18 Comment(3)
-1 for the last sentence: 20GB is not very big by today's standards, and using a weak redundancy check will come back and bite you (maybe not next week, maybe not next month, maybe not even next year). Finding 10 years worth of archives in unreadable state (yet with CRC claiming to be correct) kind of sucks.Interfertile
The forum thread was great ... especially as someone had done an actual test of the reliability ... would be great to find something similar for big files (though I realize the time it would take :s)Gonzales
What is a "corruption"? If it's one byte, then that's not possible. At least four bytes must be corrupted before you can get back to the same crc-32.Auricula
P
2

Adler-32 and MD5 are not comparable in this way. MD5 is actually intended to be a cryptographic checksum when you want to make sure that a file hasn't been tampered with by an adversary, while Adler-32 (and also CRC, which is comparable to Adler-32) is intended for making sure a file hasn't been tampered with by accident (integrity checksum.)

MD5 is actually considered broken for its cryptographic purposes, and is only useful now as an integrity check when you want more bits for certainty. The only way Adler-32 can be "less reliable" is that it allows potentially more bits to be altered while retaining the same output, which means there is more room for collisions.

This link gives a good discussion as to how using Adler-32 can provide performance benefits for some kinds of code which needs to use cryptographic sums for added certainty. Namely, that you can use the smaller and cheap checksum to see if doing the more expensive MD5/SHA/Whirlpool is worth considering in the event of changed files.

Phenanthrene answered 23/10, 2013 at 16:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.