Performance and security considerations aside, and assuming a hash function with a perfect avalanche effect, which should I use for checksumming blocks of data: CRC32 or hash truncated to N bytes? I.e. which will have a smaller probability to miss an error? Specifically:
- CRC32 vs. 4-byte hash
- CRC32 vs. 8-byte hash
- CRC64 vs. 8-byte hash
Data blocks are to be transferred over network and stored on disk, repeatedly. Blocks can be 1KB to 1GB in size.
As far as I understand, CRC32 can detect up to 32 bit flips with 100% reliability, but after that its reliability approaches 1-2^(-32)
and for some patterns is much worse. A perfect 4-byte hash reliability is always 1-2^(-32)
, so go figure.
8-byte hash should have a much better overall reliability (2^(-64)
chance to miss an error), so should it be preferred over CRC32? What about CRC64?
I guess the answer depends on type of errors that might be expected in such sort of operation. Are we likely to see sparse 1-bit flips or massive block corruptions? Also, given that most storage and networking hardware implements some sort of CRC, should not accidental bit flips be taken care of already?