openstack team mailing list archive

Thread
Date

Re: ask for comments - Light weight Erasure code framework for swift

To: openstack@xxxxxxxxxxxxxxxxxxx
From: Samuel Merritt <sam@xxxxxxxxxxxxxx>
Date: Wed, 17 Oct 2012 16:24:04 -0700
In-reply-to: <A9F57F2ABA6BB2469F01E127557C6C9B0FE31B46@SHSMSX102.ccr.corp.intel.com>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20121010 Thunderbird/16.0.1

On 10/15/12 5:36 PM, Duan, Jiangang wrote:

Some of our customers are interested in Erasure code than tri-replicate to save disk space.
We propose a BP "Light weight Erasure code framework for swift", which can be found here https://blueprints.launchpad.net/swift/+spec/swift-ec
The general idea is to have some daemon on storage node to do offline scan - select code object with big enough size to do EC.

Will glad to hear any feedback on this.


Here, in no particular order, are some thoughts I have.

- Object blocks (both data blocks and parity blocks) will need to bemarked somehow so that 3 replicas of each block aren't kept. This is apretty fundamental change to Swift; up until now, all objects aretreated the same. It's essentially introducing the notion of tieredstorage into Swift.

- Who's responsible for ensuring the presence of all the blocks? Thatis, assume you have an object that's been split into ten data blocks(D1, D2, ..., D10) and 2 parity blocks (P1, P2). The drive with D7 on itdies. Which replicator(s) is(are) responsible for rebuilding D7 andstoring it on a handoff node?

If you have the replicators on each block's machine checking forfailures, then you'll wind up with more people checking each replica.Here, it would be 11 replicators ensuring that each block is present.Compare that to the full-replication case, where there are 2 replicatorschecking on it. That's going to result in more traffic on the internalnetwork.

- There will need to be throttles on the transformation daemons (replica-> EC and vice versa), as that's very IO intensive. If a big bunch ofdata is uploaded at one time and then not accessed (think largebackups), then that could be a ticking time bomb for my clusterperformance. After those objects become "cold", the transformationdaemons will thrash my disks and network turning them into EC-type objects.

- Does this open up a Swift cluster to a DoS attack? If my objects arestored w/EC, then can someone go through and request a few bytes fromeach object in my cluster a few times and cause all my objects to get"hot"? Under the proposed scheme, this would turn my objects fromEC-storage to replica-storage, filling up my disks and killing mycluster. To mitigate that, I'd have to keep enough disk around to hold 3replicas of everything, and at that point, I may as well just keep the 3replicas.

- Another thought for a resource-consumption attack: can someone slowlywalk my objects and make a large fraction (say, 5%) of them hot eachday? That seems like it would make the transformation daemons run atmaximum capacity all the time trying to keep up.

- Retrieval of EC-stored objects becomes more failure-prone. Withreplica-stored objects, 1 out of 3 object servers has to be availablefor a GET request to work. With EC-stored objects and a 10:2 coding, 10out of 12 object servers have to be available. That makes networkpartitions much worse for data availability.

- EC-storage is at odds with geographic replication. Of course, Swiftsupports neither one today. However, with geographic replication, onewants to have a local replica of each each object in each geographicregion, which results in more copies for lower latency. With EC-storage,less data is stored. When they're combined, the result is a whole lot oftraffic across slow, expensive WAN links.

- Recombining EC-stored object chunks is going to chew up a ton more CPUon either the object or proxy servers, depending on which one does it.If the proxy, then it'll add more to an already CPU-heavy workload. Ifthe object server, then it'll make using big storage boxes lesspractical (like one of the 48-drives-in-4U servers one can buy).

- Can one change the EC-coding level? That is, if I'm using 10:2 coding(so each object turns into 10 data blocks and 2 parity blocks), can Ichange that later? Will that have massive performance impacts on mycluster as more data blocks are computed?

It may be that this is like changing the replica count, and the answeris "yes, but your cluster will thrash for a long time after you do it".

- Where's the original checksum stored? Clearly, each block will haveits own checksum for the auditors to use. However, if a client issues arequest like "HEAD /a/c/o", that'll contain the checksum of the originalfile. Does that live somewhere, or will the proxy have to read all thebytes and determine the checksum?

- I wonder what effect this will have on internal-network traffic. Witha replica-stored object, the proxy opens one connection to an objectserver, sends a request, gets a response, and streams the bytes out tothe client.

With an EC-stored object, the proxy has to open connections to, say, 10different object servers. Further, if one of the data blocks isunavailable (say data block 5), then the proxy has to go ahead andre-request all the data blocks plus a parity block so that it can fillin the gaps. That may be a significant increase in traffic on Swift'sinternal network. Further, by using such a large number of connections,it considerably increases the probability of a connection failure, whichwould mean more client requests would fail with truncated downloads.

Those are all the thoughts I have right now that are coherent enough toput into text. Clearly, adding erasure coding (or any other form oftiered storage) to Swift is not something undertaken lightly.


Hope this helps.

Follow ups

Re: ask for comments - Light weight Erasure code framework for swift
From: Eugene Kirpichov, 2012-10-17

References

ask for comments - Light weight Erasure code framework for swift
From: Duan, Jiangang, 2012-10-16