swift team mailing list archive

Thread
Date

Re: Deploying and maintaining rings

To: Andrew Shafer <andrew@xxxxxxxxxxxxxxxx>
From: Gregory Holt <gholt@xxxxxxxxxxxxx>
Date: Sun, 5 Sep 2010 19:35:01 -0500
Cc: swift@xxxxxxxxxxxxxxxxxxx
In-reply-to: <AANLkTikB6kFnk2z8xMnbJoMBtA5d3Mr1wpC_3a=ECWzc@mail.gmail.com>

On Sep 5, 2010, at 2:43 PM, Andrew Shafer wrote:

> One more clarification, when you rebuild the rings, how do you manage the timing of pushing out the new rings? Do you push them out incrementally?
> 
> And how does the cluster behave when part of it is still on the old/different ring?

This'll be a bit of a long response, but there are a lot of details.

It's best to push out each ring just after running the 'rebalance' stage, as that's the stage that actually reassigns partitions to devices. You can do as many 'add', 'remove', and 'set_weight' operations as you want before the rebalance. The ring builder will only allow one replica of each partition to be reassigned within the 'min_part_hours' specified in the builder (set on ring creation, changeable after).

The key there is that if only one replica of a partition is reassigned, then if a node has the previous ring it will still have two of the three correct nodes (assuming three replicas here).

Assuming it's a storage node with the previous ring, it would just mean it'd send copies of it's data for that partition to an 'incorrect' node (which will accept it and move it to where it belongs later). Object and container nodes also send updates to container and account nodes, so 1/3 of those would not succeed (but would be persisted to be retried later).

If it's a proxy with the previous ring, reads would have a slightly higher failure rate for that partition as it would only have two chances to read instead of three. Writes would send one of the replicas off to the wrong node, but that would get replicated to where it belonged later.

This is also where consistency issues can arise, imagine a ring just changed but one proxy has a previous version. The rest of the system is sending writes to the correct nodes and is working on moving the replica data to the correct nodes. That proxy could read older data from the to-be-moved replica after another proxy had written a new version of that data to the new, correct nodes. This would happen just on that proxy for a about 1/3 of the reads for that partition assuming another proxy wrote new data that read should see, assuming the older node had any associated data as a not-found read would move on to the next, correct replica.

So the ring should be pushed out in a "timely" manner mainly for consistency reasons. Timely depending on how lenient you are about consistency windows. If you were very stringent, you could push the rings out to a temp location in one pass and then move them all into place in a second pass.

Back to the min_part_hours of the ring builder. This should be set depending on how long it is taking your cluster to recover from a ring change (you can monitor this with swift-stats-report). Setting it to an hour to start with is fine. Each time you push out a new ring you should monitor how long the cluster takes to recover and adjust this min_part_hours accordingly. Say it takes 1.5 hours as the cluster gets more data, you should bump the min_part_hours to 2 or even 3 to be safer.

Since the min_part_hours prevents moving more than one replica of a partition per that time frame, a perfectly balanced ring can't always be achieved with the one rebalance. The ring builder will notice if the balance is off by more than 5% and indicate you should push the ring it just made, wait min_part_hours, and then rebalance and repush.

A scenario where this can occur is say, doubling the size of the cluster. It will only move 1/3 of the replica data to the new nodes at first, wait min_part_hours, then move the remaining 1/6 of the replica data to get even balance again.

In the case where, say, it's taking 6 hours to recover but min_part_hours is set to 1 and you push out new rings every hour for 5 hours (just making up a pretty bad scenario), no data will be lost, it just might be unreachable until the cluster finally does recover, moving all the replicas to the nodes they belong. Unreachable just means that reads could fail on all three replicas of the data because all three replicas are floating around on older nodes and haven't yet reached their proper destinations.

Sorry for the long response, but there are even more details than this (failed nodes during ring pushes, etc.)

References

Deploying and maintaining rings
From: Andrew Shafer, 2010-09-05
Re: Deploying and maintaining rings
From: Gregory Holt, 2010-09-05
Re: Deploying and maintaining rings
From: Andrew Shafer, 2010-09-05