openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #17663
Re: Expanding Storage - Rebalance Extreeemely Slow (or Stalled?)
On Tue, Oct 23, 2012 at 12:16 PM, Emre Sokullu <emre@xxxxxxxxxxxxxx> wrote:
> Folks,
>
> This is the 3rd day and I see no or very little (kb.s) change with the new
> disks.
>
> Could it be normal, is there a long computation process that takes time
> first before actually filling newly added disks?
>
> Or should I just start from scratch with the "create" command this time. The
> last time I did it, I didn't use the "swift-ring-builder create 20 3 1 .."
> command first but just started with "swift-ring-builder add ..." and used
> existing ring.gz files, thinking otherwise I could be reformatting the whole
> stack. I'm not sure if that's the case.
>
That is correct - you don't want to recreate the rings, since that is
likely to cause redundant partition movement.
> Please advise. Thanks,
>
I think your expectations might be misplaced. the ring builder tries
to not move partitions needlessly. In your cluster, you had 3
zones(and i'm assuming 3 replicas). swift placed the partitions as
efficiently as it could, spread across the 3 zones (servers). As
things stand, there's no real reason for partitions to move across the
servers. I'm guessing that the data growth you've seen is from new
data, not from existing data movement (but there are some calls to
random in the code which might have produced some partition movement).
If you truly want to move things around forcefully, you could:
* decrease the weight of the old devices. This would cause them to be
over weighted, and partitions reassigned away from them.
* delete and re-add devices to the ring. This will cause all the
partitions from the deleted devices to be spread across the new set of
devices.
After you perform your ring manipulation commands, execute the
rebalance command and copy the ring files.
This is likely to cause *lots* of activity in your cluster... which
seems to be the desired outcome. Its likely to have negative impact of
service requests to the proxy. It's something you probably want to be
careful about.
If you leave things alone as they are, new data will be distributed on
the new devices, and as old data gets deleted usage will rebalance
over time.
> --
> Emre
>
> On Mon, Oct 22, 2012 at 12:09 PM, Emre Sokullu <emre@xxxxxxxxxxxxxx> wrote:
>>
>> Hi Samuel,
>>
>> Thanks for quick reply.
>>
>> They're all 100. And here's the output of swift-ring-builder
>>
>> root@proxy1:/etc/swift# swift-ring-builder account.builder
>> account.builder, build version 13
>> 1048576 partitions, 3 replicas, 3 zones, 12 devices, 0.00 balance
>> The minimum number of hours before a partition can be reassigned is 1
>> Devices: id zone ip address port name weight partitions
>> balance meta
>> 0 1 192.168.1.3 6002 c0d1p1 100.00 262144
>> 0.00
>> 1 1 192.168.1.3 6002 c0d2p1 100.00 262144
>> 0.00
>> 2 1 192.168.1.3 6002 c0d3p1 100.00 262144
>> 0.00
>> 3 2 192.168.1.4 6002 c0d1p1 100.00 262144
>> 0.00
>> 4 2 192.168.1.4 6002 c0d2p1 100.00 262144
>> 0.00
>> 5 2 192.168.1.4 6002 c0d3p1 100.00 262144
>> 0.00
>> 6 3 192.168.1.5 6002 c0d1p1 100.00 262144
>> 0.00
>> 7 3 192.168.1.5 6002 c0d2p1 100.00 262144
>> 0.00
>> 8 3 192.168.1.5 6002 c0d3p1 100.00 262144
>> 0.00
>> 9 1 192.168.1.3 6002 c0d4p1 100.00 262144
>> 0.00
>> 10 2 192.168.1.4 6002 c0d4p1 100.00 262144
>> 0.00
>> 11 3 192.168.1.5 6002 c0d4p1 100.00 262144
>> 0.00
>>
>> On Mon, Oct 22, 2012 at 12:03 PM, Samuel Merritt <sam@xxxxxxxxxxxxxx>
>> wrote:
>> > On 10/22/12 9:38 AM, Emre Sokullu wrote:
>> >>
>> >> Hi folks,
>> >>
>> >> At GROU.PS, we've been an OpenStack SWIFT user for more than 1.5 years
>> >> now. Currently, we hold about 18TB of data on 3 storage nodes. Since
>> >> we hit 84% in utilization, we have recently decided to expand the
>> >> storage with more disks.
>> >>
>> >> In order to do that, after creating a new c0d4p1 partition in each of
>> >> the storage nodes, we ran the following commands on our proxy server:
>> >>
>> >> swift-ring-builder account.builder add z1-192.168.1.3:6002/c0d4p1 100
>> >> swift-ring-builder container.builder add z1-192.168.1.3:6002/c0d4p1 100
>> >> swift-ring-builder object.builder add z1-192.168.1.3:6002/c0d4p1 100
>> >> swift-ring-builder account.builder add z2-192.168.1.4:6002/c0d4p1 100
>> >> swift-ring-builder container.builder add z2-192.168.1.4:6002/c0d4p1 100
>> >> swift-ring-builder object.builder add z2-192.168.1.4:6002/c0d4p1 100
>> >> swift-ring-builder account.builder add z3-192.168.1.5:6002/c0d4p1 100
>> >> swift-ring-builder container.builder add z3-192.168.1.5:6002/c0d4p1 100
>> >> swift-ring-builder object.builder add z3-192.168.1.5:6002/c0d4p1 100
>> >>
>> >> [snip]
>> >
>> >>
>> >> So right now, the problem is; the disk growth in each of the storage
>> >> nodes seems to have stalled,
>> >
>> > So you've added 3 new devices to each ring and assigned a weight of 100
>> > to
>> > each one. What are the weights of the other devices in the ring? If
>> > they're
>> > much larger than 100, then that will cause the new devices to end up
>> > with a
>> > small fraction of the data you want on them.
>> >
>> > Running "swift-ring-builder <thing>.builder" will show you information,
>> > including weights, of all the devices in the ring.
>> >
>> >
>> >
>> >> * Bonus question: why do we copy ring.gz files to storage nodes and
>> >> how critical they are. To me it's not clear how Swift can afford to
>> >> wait (even though it's just a few seconds ) for .ring.gz files to be
>> >> in storage nodes after rebalancing- if those files are so critical.
>> >
>> >
>> > The ring.gz files contain the mapping from Swift partitions to disks. As
>> > you
>> > know, the proxy server uses it to determine which backends have the data
>> > for
>> > a given request. The replicators also use the ring to determine where
>> > data
>> > belongs so that they can ensure the right number of replicas, etc.
>> >
>> > When two storage nodes have different versions of a ring.gz file, you
>> > can
>> > get replicator fights. They look like this:
>> >
>> > - node1's (old) ring says that the partition for a replica of
>> > /cof/fee/cup
>> > belongs on node2's /dev/sdf.
>> > - node2's (new) ring says that the same partition belongs on node1's
>> > /dev/sdd.
>> >
>> > When the replicator on node1 runs, it will see that it has the partition
>> > for
>> > /cof/fee/cup on its disk. It will then consult the ring, push that
>> > partition's contents to node2, and then delete its local copy (since
>> > node1's
>> > ring says that this data does not belong on node1).
>> >
>> > When the replicator on node2 runs, it will do the converse: push to
>> > node1,
>> > then delete its local copy.
>> >
>> > If you leave the rings out of sync for a long time, then you'll end up
>> > consuming disk and network IO ping-ponging a set of data around. If
>> > they're
>> > out of sync for a few seconds, then it's not a big deal.
>> >
>> > _______________________________________________
>> > Mailing list: https://launchpad.net/~openstack
>> > Post to : openstack@xxxxxxxxxxxxxxxxxxx
>> > Unsubscribe : https://launchpad.net/~openstack
>> > More help : https://help.launchpad.net/ListHelp
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help : https://help.launchpad.net/ListHelp
>
Follow ups
References