openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #17518
Re: ask for comments - Light weight Erasure code framework for swift
On 10/15/12 5:36 PM, Duan, Jiangang wrote:
Some of our customers are interested in Erasure code than tri-replicate to save disk space.
We propose a BP "Light weight Erasure code framework for swift", which can be found here https://blueprints.launchpad.net/swift/+spec/swift-ec
The general idea is to have some daemon on storage node to do offline scan - select code object with big enough size to do EC.
Will glad to hear any feedback on this.
Here, in no particular order, are some thoughts I have.
- Object blocks (both data blocks and parity blocks) will need to be
marked somehow so that 3 replicas of each block aren't kept. This is a
pretty fundamental change to Swift; up until now, all objects are
treated the same. It's essentially introducing the notion of tiered
storage into Swift.
- Who's responsible for ensuring the presence of all the blocks? That
is, assume you have an object that's been split into ten data blocks
(D1, D2, ..., D10) and 2 parity blocks (P1, P2). The drive with D7 on it
dies. Which replicator(s) is(are) responsible for rebuilding D7 and
storing it on a handoff node?
If you have the replicators on each block's machine checking for
failures, then you'll wind up with more people checking each replica.
Here, it would be 11 replicators ensuring that each block is present.
Compare that to the full-replication case, where there are 2 replicators
checking on it. That's going to result in more traffic on the internal
network.
- There will need to be throttles on the transformation daemons (replica
-> EC and vice versa), as that's very IO intensive. If a big bunch of
data is uploaded at one time and then not accessed (think large
backups), then that could be a ticking time bomb for my cluster
performance. After those objects become "cold", the transformation
daemons will thrash my disks and network turning them into EC-type objects.
- Does this open up a Swift cluster to a DoS attack? If my objects are
stored w/EC, then can someone go through and request a few bytes from
each object in my cluster a few times and cause all my objects to get
"hot"? Under the proposed scheme, this would turn my objects from
EC-storage to replica-storage, filling up my disks and killing my
cluster. To mitigate that, I'd have to keep enough disk around to hold 3
replicas of everything, and at that point, I may as well just keep the 3
replicas.
- Another thought for a resource-consumption attack: can someone slowly
walk my objects and make a large fraction (say, 5%) of them hot each
day? That seems like it would make the transformation daemons run at
maximum capacity all the time trying to keep up.
- Retrieval of EC-stored objects becomes more failure-prone. With
replica-stored objects, 1 out of 3 object servers has to be available
for a GET request to work. With EC-stored objects and a 10:2 coding, 10
out of 12 object servers have to be available. That makes network
partitions much worse for data availability.
- EC-storage is at odds with geographic replication. Of course, Swift
supports neither one today. However, with geographic replication, one
wants to have a local replica of each each object in each geographic
region, which results in more copies for lower latency. With EC-storage,
less data is stored. When they're combined, the result is a whole lot of
traffic across slow, expensive WAN links.
- Recombining EC-stored object chunks is going to chew up a ton more CPU
on either the object or proxy servers, depending on which one does it.
If the proxy, then it'll add more to an already CPU-heavy workload. If
the object server, then it'll make using big storage boxes less
practical (like one of the 48-drives-in-4U servers one can buy).
- Can one change the EC-coding level? That is, if I'm using 10:2 coding
(so each object turns into 10 data blocks and 2 parity blocks), can I
change that later? Will that have massive performance impacts on my
cluster as more data blocks are computed?
It may be that this is like changing the replica count, and the answer
is "yes, but your cluster will thrash for a long time after you do it".
- Where's the original checksum stored? Clearly, each block will have
its own checksum for the auditors to use. However, if a client issues a
request like "HEAD /a/c/o", that'll contain the checksum of the original
file. Does that live somewhere, or will the proxy have to read all the
bytes and determine the checksum?
- I wonder what effect this will have on internal-network traffic. With
a replica-stored object, the proxy opens one connection to an object
server, sends a request, gets a response, and streams the bytes out to
the client.
With an EC-stored object, the proxy has to open connections to, say, 10
different object servers. Further, if one of the data blocks is
unavailable (say data block 5), then the proxy has to go ahead and
re-request all the data blocks plus a parity block so that it can fill
in the gaps. That may be a significant increase in traffic on Swift's
internal network. Further, by using such a large number of connections,
it considerably increases the probability of a connection failure, which
would mean more client requests would fail with truncated downloads.
Those are all the thoughts I have right now that are coherent enough to
put into text. Clearly, adding erasure coding (or any other form of
tiered storage) to Swift is not something undertaken lightly.
Hope this helps.
Follow ups
References