← Back to team overview

openstack team mailing list archive

Re: ask for comments - Light weight Erasure code framework for swift

 

On 10/15/12 5:36 PM, Duan, Jiangang wrote:
Some of our customers are interested in Erasure code than tri-replicate to save disk space.
We propose a BP "Light weight Erasure code framework for swift", which can be found here https://blueprints.launchpad.net/swift/+spec/swift-ec
The general idea is to have some daemon on storage node to do offline scan - select code object with big enough size to do EC.

Will glad to hear any feedback on this.

Here, in no particular order, are some thoughts I have.

- Object blocks (both data blocks and parity blocks) will need to be marked somehow so that 3 replicas of each block aren't kept. This is a pretty fundamental change to Swift; up until now, all objects are treated the same. It's essentially introducing the notion of tiered storage into Swift.

- Who's responsible for ensuring the presence of all the blocks? That is, assume you have an object that's been split into ten data blocks (D1, D2, ..., D10) and 2 parity blocks (P1, P2). The drive with D7 on it dies. Which replicator(s) is(are) responsible for rebuilding D7 and storing it on a handoff node?

If you have the replicators on each block's machine checking for failures, then you'll wind up with more people checking each replica. Here, it would be 11 replicators ensuring that each block is present. Compare that to the full-replication case, where there are 2 replicators checking on it. That's going to result in more traffic on the internal network.

- There will need to be throttles on the transformation daemons (replica -> EC and vice versa), as that's very IO intensive. If a big bunch of data is uploaded at one time and then not accessed (think large backups), then that could be a ticking time bomb for my cluster performance. After those objects become "cold", the transformation daemons will thrash my disks and network turning them into EC-type objects.

- Does this open up a Swift cluster to a DoS attack? If my objects are stored w/EC, then can someone go through and request a few bytes from each object in my cluster a few times and cause all my objects to get "hot"? Under the proposed scheme, this would turn my objects from EC-storage to replica-storage, filling up my disks and killing my cluster. To mitigate that, I'd have to keep enough disk around to hold 3 replicas of everything, and at that point, I may as well just keep the 3 replicas.

- Another thought for a resource-consumption attack: can someone slowly walk my objects and make a large fraction (say, 5%) of them hot each day? That seems like it would make the transformation daemons run at maximum capacity all the time trying to keep up.

- Retrieval of EC-stored objects becomes more failure-prone. With replica-stored objects, 1 out of 3 object servers has to be available for a GET request to work. With EC-stored objects and a 10:2 coding, 10 out of 12 object servers have to be available. That makes network partitions much worse for data availability.

- EC-storage is at odds with geographic replication. Of course, Swift supports neither one today. However, with geographic replication, one wants to have a local replica of each each object in each geographic region, which results in more copies for lower latency. With EC-storage, less data is stored. When they're combined, the result is a whole lot of traffic across slow, expensive WAN links.

- Recombining EC-stored object chunks is going to chew up a ton more CPU on either the object or proxy servers, depending on which one does it. If the proxy, then it'll add more to an already CPU-heavy workload. If the object server, then it'll make using big storage boxes less practical (like one of the 48-drives-in-4U servers one can buy).

- Can one change the EC-coding level? That is, if I'm using 10:2 coding (so each object turns into 10 data blocks and 2 parity blocks), can I change that later? Will that have massive performance impacts on my cluster as more data blocks are computed?

It may be that this is like changing the replica count, and the answer is "yes, but your cluster will thrash for a long time after you do it".

- Where's the original checksum stored? Clearly, each block will have its own checksum for the auditors to use. However, if a client issues a request like "HEAD /a/c/o", that'll contain the checksum of the original file. Does that live somewhere, or will the proxy have to read all the bytes and determine the checksum?

- I wonder what effect this will have on internal-network traffic. With a replica-stored object, the proxy opens one connection to an object server, sends a request, gets a response, and streams the bytes out to the client.

With an EC-stored object, the proxy has to open connections to, say, 10 different object servers. Further, if one of the data blocks is unavailable (say data block 5), then the proxy has to go ahead and re-request all the data blocks plus a parity block so that it can fill in the gaps. That may be a significant increase in traffic on Swift's internal network. Further, by using such a large number of connections, it considerably increases the probability of a connection failure, which would mean more client requests would fail with truncated downloads.


Those are all the thoughts I have right now that are coherent enough to put into text. Clearly, adding erasure coding (or any other form of tiered storage) to Swift is not something undertaken lightly.

Hope this helps.


Follow ups

References