I develop backup software among other things and I can tell you, that a CRC is n...

evanj · on Jan 30, 2017

You make an excellent point: probability says a 4-byte CRC32C provides much weaker guarantees as the length of the message gets longer. These CRCs are typically optimized for pretty short messages, and that is what I had in mind when I wrote that article (e.g. the kind of messages that might get exchanged as part of an online serving system).

For overkill discussions about these issues, see

"32-Bit Cyclic Redundancy Codes for Internet Applications" which talks about the kinds of errors that various CRCs are guaranteed to detect: https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koop...

beagle3 · on Jan 30, 2017

I think it's an apple to orange comparison; sha1 needs to be compared to a proper crc160, and sha2 to a proper crc256.

(Proper means one nontrivial cycle e.g. An irreducible generating polynomial of a proper size)

I would be surprised if there's material difference in error detection between sha2 and crc255 (other than the 1 bit difference).

However, crc is useless against an adversary - preimage attacks are trivial. at the same size, it becomes an apple to apple comparison, and whether you choose depends on a speed/preimage requirements.

cesarb · on Jan 30, 2017

Or go the other way: compare the crc32 to a 4-byte blake2 hash (blake2 can vary its output length on multiples of 8 bits).

I'd expect that, no matter how many bits were corrupted, the pattern of the corruption, or the size of the file, a 4-byte blake2 hash would have always a 1 in 4294967296 chance of not catching the error; for crc32, the chance of not catching the error would increase with the size of the file or the number of corrupted bits. On the other hand, a CRC is guaranteed to catch certain corruption patterns (like all burst errors of up to a certain size), while for a cryptographic hash like blake2, there's no such guarantee.

(A cryptographic hash sent together with the data, by itself, is also useless against an adversary, since the adversary can change the hash. You need at least a keyed hash to protect against that.)

beagle3 · on Jan 31, 2017

I do not know what you would think that.

A crc32 will miss an error if and only if crc32(error-pattern)=0, which happens in 1:2^32, assuming a full-cycle polynomial. While these patterns are easy to characterize, and are independent of the data (neither is true for a crypto hash), the probabilities are the same.

mikeash · on Jan 30, 2017

Is it really an apple to orange comparison? CRC can't do everything SHA does, but SHA can do what CRC does, so it's more like an apple to apple-and-orange comparison. And that's an easy comparison: pick the second one! You get a free orange!

Is there something that a proper CRC256 would be better at than SHA-256?

beagle3 · on Jan 30, 2017

Crc is much simpler; depending on the polynomial, it can be as little as 2 xor gates and a shift register; if you do your own asic/fpga, I suspect it could be 10-100 times smaller and faster than sha256. In a modern CPU, You can do something like 100 bits/clock of CRC, 10-100 times slower for Sha.

The other thing that crc can do that sha256 cannot is an error correcting code of sort; if your errors are limited to a small subset (say, only bursts), you can tabulate the crc of those errors, and then (expected-crc xor computed-crc) is the crc of the error; if it's in your list, you know what the error is. I had actually worked on an device where this was useful some 20 years ago.

mikeash · on Jan 30, 2017

That second bit is pretty cool, I didn't know about that. I actually experimented a bit with using cryptographic hashes as an error correcting code, but you have to brute force it, so it's only practical for small amounts of data and small errors.

marcosdumay · on Jan 30, 2017

CRC64 will probably[1] suffice for any data package you'll see on your life. A cryptographic hash of 64 bits won't.

That said, it's probably easier to find a reliable implementation of SHA256 than of some CRC64, so, if the added CPU cost doesn't bother you[2], I'd say SHA256 is the best choice.

1 - CRC40 will likely be enough. Those extra 24 bits are a huge safety margin.

2 - And it normally shouldn't, not even for performance sensitive applications. The hash is very fast, and you are doing it before IO. Except for some fast disappearing niches, it just won't matter.

mikeash · on Jan 30, 2017

What would a CRC64 do better than, say, using the first 64 bits of the SHA-256 (other than performance, of course)? I see talk of CRCs always catching contiguous errors smaller than the CRC size, is it just that a CRC is guaranteed to do that and a 64-bit cryptographic hash isn't?

beagle3 · on Jan 30, 2017

Yes. CRC is linear in the data and the error. This makes it easy to analyze the crc of specific error sets, and if non of them have zero crc, then if the error is in that set it is guaranteed to be found.

throwawayish · on Jan 30, 2017

Of course it's a bit of an apples to oranges comparison.

An interesting question about CRCs is whether you can implement them efficiently. CRC32 is possible with some help from the CPU (CLMUL as mentioned, or complete instructions for specific polynomials, like crc32c); eg. on Haswell you get to 14-16 GB/s with a CLMUL approach on CRC32. That's obviously much faster than the hashes I mentioned. With the zlib implementation, on the other hand, you only get about one GB/s and there it's really close to B2b already.

Table-based approaches (classic zlib implementation, or slice-by-n, which needs n-times as many tables) would not work well with longer polynomials; the tables would outgrow L1/L2 cache sizes rather quickly. I'd have to think a bit about the tableless CLMUL approach.

beagle3 · on Feb 1, 2017

zlib does not use crc at all (zip does, but zlib doesn't). It uses adler-32 which shares most properties with crc32 (prob of missing a random error is almost the same, error patterns are easy to analyze and are 16-bit additive and independent of data).

rwallace · on Jan 30, 2017

However, because SHA-256 is so widely used, efficient implementations are available off the shelf; I've even seen talk about providing it in hardware on the CPU. Given that, you might as well just go ahead and use it.

throwawayish · on Jan 30, 2017

Various CPUs have hardware SHA-1/2 support, eg. everything with ARM NEON = many smartphones and ARM chips, even some ARM boards. Most Sun/Oracle SPARC machines have crypto engines that do this, and many other algorithms; the same goes for POWER.

I'm not quite sure why no mainstream x86 extensions have extra stuff for this (VIA PadLock does SHA), but I can take a guess: ARX is already reasonably efficient in software (more so with newer designs, eg. BLAKE2 + AVX2) - unlike AES. And unlike AES there aren't really many problems with side channels, especially when we think about GCM.

bogomipz · on Jan 30, 2017

I am not following this:

>"A CRC32 over a couple megs of data is only good enough to likely catch "gross"

The CRC on TCP MSS or on the ethernet frame would never be higher than 1460 or 1500 byes(or 9K in the case of jumbo frames), how or where would a CRC over multiple megs come into play in regular networking?

CRC is used on jumbo ethernet of of sized of 9K and it works just fine. In fact this is the reason that 9K is the limit on jumbo frames.

X-Istence · on Jan 30, 2017

The author of the article was arguing that you should add a CRC to the data being sent across TCP/IP. So that would encompass the whole payload, not just per-packet.

bogomipz · on Jan 30, 2017

I see, I misunderstood the context of the comment. Thanks.