← Back to team overview

novacut-community team mailing list archive

Thoughts on SHA-3 (Keccak)

 

Fair warning: this is a very technical email about the applied cryptography
used in Dmedia (and indirectly, Novacut), which I know some on this list might
find extraordinarily boring.  My feelings wont be hurt if you stop reading now
:D

Background
==========

Files in Dmedia are given a globally unique ID based on their content-hash.
This is a fundamental part of Dmedia's design, and without it we couldn't build
such a loosely-coupled distributed filesystem.  In particular, deriving the ID
from the content-hash allows us to deal with removable storage in a very
elegant way.

Dmedia uses a custom hashing protocol[1] designed to solve a lot of practical
problems faced when transferring large files across the Internet, and to also
have some other nice engineering properties.

Our current protocol uses the Skein[2] hash, which was one five SHA-3
finalists, but not the winner.  The winner, announced on October 2nd, was
Keccak[3].  So I've spent a fair amount of time the last month learning about
Keccak and trying to decided whether we should stick with Skein, or switch to
the SHA-3 winner.

In short, I personally think we should stick with Skein, for reasons I'll
detail below.  However, I'll note that we don't expect the first, or any other
version of the protocol to be used "forever".  We have a simple migration
strategy and a plan that allows multiple version of the protocol to exist side
by side.  (The migration strategy is to use a different digest size, so we can
tell the version of the protocol by the length of the ID.)

That said, adding a new protocol version is very disruptive and expensive.  We
want each version of the protocol to have as long a useful life as possible,
and I welcome challenges to my rationale and conclusion.

I feel we should stick with Skein for two reasons:

    1. Performance (in software)

    2. Cryptographic Tying (the Skein parameter system)


Performance
===========

Here are the numbers for hashing 512 MiB of data on my 3.2 GHz Phenom II X4 955
(64-bit software):

    Hash       Time  MB/s
    =========  ====  =====
    skein256   1.22  439.8
    skein512   1.11  485.6
    skein1024  1.45  369.8
    sha3_224   1.99  269.9
    sha3_256   2.10  255.7
    sha3_384   2.73  196.5
    sha3_512   3.95  135.9

This was done using PySkein[4] and PySha3[5], both of which are pure C
implementations that don't make use of SIMD instructions.  The benchmark is the
hashbench.py[6] script found in the Dmedia source tree.

Our current protocol uses Skein-512 with a 240-bit digest size.  To get the
same digest size, we'd need to use sha3-256 (and then truncate).  So that's
255 MB/s for Keccak, and 485 MB/s for Skein, making Skein almost twice as fast.
(For reference, sha1 gets 478 MB/s on my system, just a touch slower than
Skein.)

Now faster is always nice of course, but my reason to favor Skein has to do
with where these performance numbers sit in relation to current storage IO
performance.  There are HDDs on the market capable of 150 MB/s, and SSDs
capable of over 500 MB/s.  There are SD and CF cards on the market capable of
over 100 MB/s (and we want to be able to import several in parallel).

Full content-hash verification is a key part of the active monitoring that
Dmedia does.  Importantly, we always do full verification when creating
additional copies of a file, because we want to be certain that we're at least
writing a known-good copy out to a storage device.  And a slower hash means
more power, more battery drain every time we do this.

Keccak can be very fast in hardware, faster than Skein.  However, that isn't a
here-and-now solution we can use.  Looking at the history of SHA-1 and SHA-2,
I'm a bit skeptical as to whether we'll ever see readily available hardware
implementations of SHA-3.  And even if that happens, it wont be any time soon
(consider how long it took for AES instructions to show up).

In the here and now, we need to make our decision based on software
performance.  And for how we use hashing in Dmedia, even Skein isn't as fast as
I wish it was.  And again, the version 1 hashing protocol isn't a forever and
ever solution.  It just needs to be practical now and have a reasonably long
useful lifetime.


Cryptographic Tying
===================

The Dmedia hashing protocol cryptographically ties the root-hash to the
file-size, and the leaf-hash to the leaf-index (the leaf-wise position in the
file, starting from zero).

Our cryptographic tying problem is similar to the MAC (Message Authentication
Code) problem.  And like the MAC problem, it's non-trivial to securely build
from a hash function.  I'll admit right now that during the dozen or so
iterations of the Dmedia hashing protocol, I've made pretty much every common
amateur mistake, and then some.

Thanks to critique from a number of people, the current protocol is decent,
but it still has problems.  In retrospect, using standard HMAC with Skein would
have been better than my construction.

But even better would be to use the Skein parameter system, which was designed
to be usable for MAC without the overhead of HMAC.  And that's exactly what
I've done in latest proposed protocol revision[1].

The Skein parameter system provides a very generic cryptographic tying
mechanism, which I came to appreciate much more while working on the secure
device peering protocol that landed in Dmedia 12.10[7].  I finally got my head
around HMAC and why it's designed the way it is, and realized we should just
use the Skein *key* parameter to do our cryptographic tying (the parameters are
all equivalent cryptographically, *key* just seems the clearest for this use
as far as the parameter name).

Tying in the leaf-index and file-size have proven a very practical features in
the protocol, something I think we are well-served by keeping.  But the problem
is Keccak doesn't take the same batteries included approach that Skein does,
and an HMAC equivalent hasn't been defined yet for Keccak.  I imagine this will
be done with great care and might not be finalized for a year or more.  (Note
that using standard HMAC with Keccak is not advised, although I don't
understand why exactly this is.)

So if we were to switch to Keccak today, we'd either have to venture into
dangerous waters inventing our own tying mechanism (something I'm now wise
enough to avoid), or abandon the cryptographic tying features in the protocol.


Thoughts?
=========

So, what do people think?  Anyone care to make an argument for Keccak, or for
truncated sha512, or even something else?

I think all the SHA-3 finalists are conservatively designed, highly secure
hashes, with different strengths and weaknesses when it comes to specific
applications.  For me, Skein stood out as the most pragmatic from an applied
cryptography perspective, which I why I choose it.

One last note: we're considering using a 280-bit[8] digest instead of 240-bit,
in which case we'd need to use the even slower sha3_384 were we to go the
Keccak route.


Links
=====

[1] Dmedia Hashing Protocol: http://docs.novacut.com/filestore/protocol2.html
[2] Skein: http://www.skein-hash.info/
[3] Keccak: http://keccak.noekeon.org/
[4] PySkein: http://packages.python.org/pyskein/
[5] PySha3: http://pypi.python.org/pypi/pysha3/
[6] http://bazaar.launchpad.net/~dmedia/dmedia/trunk/view/head:/hashbench.py
[7] https://bugs.launchpad.net/dmedia/+bug/1064674
[8] https://bugs.launchpad.net/filestore/+bug/1088404