Checksumming: CRC or hash?
Asked Answered
F

2

10

Performance and security considerations aside, and assuming a hash function with a perfect avalanche effect, which should I use for checksumming blocks of data: CRC32 or hash truncated to N bytes? I.e. which will have a smaller probability to miss an error? Specifically:

  1. CRC32 vs. 4-byte hash
  2. CRC32 vs. 8-byte hash
  3. CRC64 vs. 8-byte hash

Data blocks are to be transferred over network and stored on disk, repeatedly. Blocks can be 1KB to 1GB in size.

As far as I understand, CRC32 can detect up to 32 bit flips with 100% reliability, but after that its reliability approaches 1-2^(-32) and for some patterns is much worse. A perfect 4-byte hash reliability is always 1-2^(-32), so go figure.

8-byte hash should have a much better overall reliability (2^(-64) chance to miss an error), so should it be preferred over CRC32? What about CRC64?

I guess the answer depends on type of errors that might be expected in such sort of operation. Are we likely to see sparse 1-bit flips or massive block corruptions? Also, given that most storage and networking hardware implements some sort of CRC, should not accidental bit flips be taken care of already?

Feather answered 26/1, 2013 at 10:39 Comment(2)
I think I'm confused what "general hash" means.Polyphony
Ok, removed "general", my bad.Feather
Z
16

Only you can say whether 1-2-32 is good enough or not for your application. The error detection performance between a CRC-n and n bits from a good hash function will be very close to the same, so pick whichever one is faster. That is likely to be the CRC-n.

Update:

The above "That is likely to be the CRC-n" is only somewhat likely. It is not so likely if very high performance hash functions are used. In particular, CityHash appears to be very nearly as fast as a CRC-32 calculated using the Intel crc32 hardware instruction! I tested three CityHash routines and the Intel crc32 instruction on a 434 MB file. The crc32 instruction version (which computes a CRC-32C) took 24 ms of CPU time. CityHash64 took 55 ms, CityHash128 60 ms, and CityHashCrc128 50 ms. CityHashCrc128 makes use of the same hardware instruction, though it does not compute a CRC.

In order to get the CRC-32C calculation that fast, I had to get fancy with three crc32 instructions on three separate buffers in order to make use of the three arithmetic logic units in parallel in a single core, and then writing the inner loop in assembler. CityHash is pretty damned fast. If you don't have the crc32 instruction, then you would be hard-pressed to compute a 32-bit CRC as fast as a CityHash64 or CityHash128.

Note however that the CityHash functions would need to be modified for this purpose, or an arbitrary choice would need to be made in order to define a consistent meaning for the CityHash value on large streams of data. The reason is that those functions are not set up to accept buffered data, i.e. feeding the functions a chunk at a time and expecting to get the same result as if the entire set of data were fed to the function at once. The CityHash functions would need to modified to update an intermediate state.

The alternative, and what I did for the quick and dirty testing, is to use the Seed versions of the functions where I would use the CityHash from the previous buffer as the seed for the next buffer. The problem with that is that the result is then dependent on the buffer size. If you feed CityHash different size buffers with this approach, you get different hash values.

Another Update four years later:

Even faster is the xxhash family. I would now recommend that over a CRC for a non-cryptographic hash.

Zee answered 26/1, 2013 at 17:6 Comment(3)
Well, there are some hash functions, like CityHash or MurMurHash which can do several bytes per clock cycle on 1K messages, so they are likely to beat unaccelerated CRC32 computation. And they produce 128-bit output to boot. So I was wondering is there anything conceptual about CRC that makes it a better checksum than a good hash. But I guess you're right, it is all about the number of bits, so I guess I'll choose hash.Feather
No, there's nothing about a CRC that makes it a better checksum, unless perhaps your noise source is a small number of bit flips. I don't know if any hash functions are guaranteed to detect all possible 1 to n bit flips as a CRC-n is guaranteed to.Zee
You're correct about CityHash. I was surprised to see how fast it is.Zee
I
0

Putting aside "performance" issues; you might want to consider using one of the SHA-2 functions (say SHA-256).

Isola answered 28/1, 2013 at 5:35 Comment(3)
Wow. That's really putting aside performance issues. SHA-256 takes 100 times as long as a CRC-32 or 50 times as long as a CityHash. And for no reason, since this is not an application that requires a cryptographic hash.Zee
Well, actually I might. May be not exactly SHA-256 since I don't need cryptographic strength, but, given that the number of bits in the checksum is paramount here, looking into 256-bit hashes might make sense. I'm just not sure there are any besides SHA-256 and if they are any good. Also this is not to hash short strings for a hash table, it is to checksum messages that should normally exceed 1KB. I guess it is a matter of benchmarking to see how much of overhead it may introduce. I'll definitely keep it as an option.Feather
Just did a quick search, and there you are: CityHash 256-bit version! Must be an order of magnitude faster than SHA.Feather

© 2022 - 2024 — McMap. All rights reserved.