← Back to team overview

openstack team mailing list archive

Re: ask for comments - Light weight Erasure code framework for swift

 

Sam, 

Your comments are pretty reasonable. 
Let me think about this and give me comments later.
Thanks.

-jiangang

On Wed, Oct 17, 2012 at 4:24 PM, Samuel Merritt <sam@xxxxxxxxxxxxxx> wrote:
> On 10/15/12 5:36 PM, Duan, Jiangang wrote:
>>
>> Some of our customers are interested in Erasure code than 
>> tri-replicate to save disk space.
>> We propose a BP "Light weight Erasure code framework for swift", 
>> which can be found here 
>> https://blueprints.launchpad.net/swift/+spec/swift-ec
>> The general idea is to have some daemon on storage node to do offline 
>> scan
>> - select code object with big enough size to do EC.
>>
>> Will glad to hear any feedback on this.
>
>
> Here, in no particular order, are some thoughts I have.
>
> - Object blocks (both data blocks and parity blocks) will need to be 
> marked somehow so that 3 replicas of each block aren't kept. This is a 
> pretty fundamental change to Swift; up until now, all objects are treated the same.
> It's essentially introducing the notion of tiered storage into Swift.
>
> - Who's responsible for ensuring the presence of all the blocks? That 
> is, assume you have an object that's been split into ten data blocks 
> (D1, D2, ..., D10) and 2 parity blocks (P1, P2). The drive with D7 on 
> it dies. Which
> replicator(s) is(are) responsible for rebuilding D7 and storing it on 
> a handoff node?
>
> If you have the replicators on each block's machine checking for 
> failures, then you'll wind up with more people checking each replica. 
> Here, it would be 11 replicators ensuring that each block is present. 
> Compare that to the full-replication case, where there are 2 
> replicators checking on it. That's going to result in more traffic on the internal network.
>
> - There will need to be throttles on the transformation daemons 
> (replica -> EC and vice versa), as that's very IO intensive. If a big 
> bunch of data is uploaded at one time and then not accessed (think 
> large backups), then that could be a ticking time bomb for my cluster 
> performance. After those objects become "cold", the transformation 
> daemons will thrash my disks and network turning them into EC-type objects.
>
> - Does this open up a Swift cluster to a DoS attack? If my objects are 
> stored w/EC, then can someone go through and request a few bytes from 
> each object in my cluster a few times and cause all my objects to get "hot"?
> Under the proposed scheme, this would turn my objects from EC-storage 
> to replica-storage, filling up my disks and killing my cluster. To 
> mitigate that, I'd have to keep enough disk around to hold 3 replicas 
> of everything, and at that point, I may as well just keep the 3 replicas.
>
> - Another thought for a resource-consumption attack: can someone 
> slowly walk my objects and make a large fraction (say, 5%) of them hot 
> each day? That seems like it would make the transformation daemons run 
> at maximum capacity all the time trying to keep up.
>
> - Retrieval of EC-stored objects becomes more failure-prone. With 
> replica-stored objects, 1 out of 3 object servers has to be available 
> for a GET request to work. With EC-stored objects and a 10:2 coding, 
> 10 out of 12 object servers have to be available. That makes network 
> partitions much worse for data availability.
>
> - EC-storage is at odds with geographic replication. Of course, Swift 
> supports neither one today. However, with geographic replication, one 
> wants to have a local replica of each each object in each geographic 
> region, which results in more copies for lower latency. With 
> EC-storage, less data is stored. When they're combined, the result is 
> a whole lot of traffic across slow, expensive WAN links.
>
> - Recombining EC-stored object chunks is going to chew up a ton more 
> CPU on either the object or proxy servers, depending on which one does 
> it. If the proxy, then it'll add more to an already CPU-heavy 
> workload. If the object server, then it'll make using big storage 
> boxes less practical (like one of the 48-drives-in-4U servers one can buy).
>
> - Can one change the EC-coding level? That is, if I'm using 10:2 
> coding (so each object turns into 10 data blocks and 2 parity blocks), 
> can I change that later? Will that have massive performance impacts on 
> my cluster as more data blocks are computed?
>
> It may be that this is like changing the replica count, and the answer 
> is "yes, but your cluster will thrash for a long time after you do it".
>
> - Where's the original checksum stored? Clearly, each block will have 
> its own checksum for the auditors to use. However, if a client issues 
> a request like "HEAD /a/c/o", that'll contain the checksum of the 
> original file. Does that live somewhere, or will the proxy have to 
> read all the bytes and determine the checksum?
>
> - I wonder what effect this will have on internal-network traffic. 
> With a replica-stored object, the proxy opens one connection to an 
> object server, sends a request, gets a response, and streams the bytes out to the client.
>
> With an EC-stored object, the proxy has to open connections to, say, 
> 10 different object servers. Further, if one of the data blocks is 
> unavailable (say data block 5), then the proxy has to go ahead and 
> re-request all the data blocks plus a parity block so that it can fill 
> in the gaps. That may be a significant increase in traffic on Swift's 
> internal network. Further, by using such a large number of 
> connections, it considerably increases the probability of a connection 
> failure, which would mean more client requests would fail with truncated downloads.
>
>
> Those are all the thoughts I have right now that are coherent enough 
> to put into text. Clearly, adding erasure coding (or any other form of 
> tiered
> storage) to Swift is not something undertaken lightly.
>
> Hope this helps.
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp



--
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov
We're hiring! http://tinyurl.com/mirantis-openstack-engineer

_______________________________________________
Mailing list: https://launchpad.net/~openstack
Post to     : openstack@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


References