← Back to team overview

openstack team mailing list archive

Re: Enabling data deduplication on Swift


Joe Gordon asked:

> Can SHA-1 collisions be generated?  If so can you point me to the article?

Check Wikipedia on cryptographic hashing and especially "preimage attack"

To summarize, SHA-256 is effectively immune from a pre image attack. Even MD5 is effectiely immune from
co-incidental collisions until you have something on the order of 2**80 fingerprinted items. Amazon S3 has a
mere 2**40.

> Also why compare hashes in the first place?  Linux 'Kenel Samepage Merging', which does page deduplication for KVM, does a full
> compare to be safe [1].  Even if collisions can't be generated, what are the odds of a collision (for SHA-1 and SHA-256) happening by
> chance when using Swift at scale?

The point of distributed reduplication is to avoid transfer of the data in the first place. If you did a safety check using a full compare
then you would have had to do the transfer first. A kernel looking for identical pages does not have that issue.

As for the chance of an accidental collision, you have to stop and think about how large a number 2**128 is (that's the
threshold for risking a birthday collision on a 256 bit key). Having multiple undetected network transmission errors back
to back is more likely.

Also I would recommend that SHA-256 be the *minimum* algorithm. Truly paranoid customers could select SHA-512.

ZFS has had SHA-256 for the purpose of detecting bit rot as the default for some time. It was a bit ambitious when ZFS
was designed, but with today's processors the computational overhead is neglible.