← Back to team overview

openstack team mailing list archive

Re: Expanding Storage - Rebalance Extreeemely Slow (or Stalled?)

 

Hi Samuel,

Thanks for quick reply.

They're all 100. And here's the output of swift-ring-builder

root@proxy1:/etc/swift# swift-ring-builder account.builder
account.builder, build version 13
1048576 partitions, 3 replicas, 3 zones, 12 devices, 0.00 balance
The minimum number of hours before a partition can be reassigned is 1
Devices:    id  zone      ip address  port      name weight partitions
balance meta
             0     1     192.168.1.3  6002    c0d1p1 100.00     262144    0.00
             1     1     192.168.1.3  6002    c0d2p1 100.00     262144    0.00
             2     1     192.168.1.3  6002    c0d3p1 100.00     262144    0.00
             3     2     192.168.1.4  6002    c0d1p1 100.00     262144    0.00
             4     2     192.168.1.4  6002    c0d2p1 100.00     262144    0.00
             5     2     192.168.1.4  6002    c0d3p1 100.00     262144    0.00
             6     3     192.168.1.5  6002    c0d1p1 100.00     262144    0.00
             7     3     192.168.1.5  6002    c0d2p1 100.00     262144    0.00
             8     3     192.168.1.5  6002    c0d3p1 100.00     262144    0.00
             9     1     192.168.1.3  6002    c0d4p1 100.00     262144    0.00
            10     2     192.168.1.4  6002    c0d4p1 100.00     262144    0.00
            11     3     192.168.1.5  6002    c0d4p1 100.00     262144    0.00

On Mon, Oct 22, 2012 at 12:03 PM, Samuel Merritt <sam@xxxxxxxxxxxxxx> wrote:
> On 10/22/12 9:38 AM, Emre Sokullu wrote:
>>
>> Hi folks,
>>
>> At GROU.PS, we've been an OpenStack SWIFT user for more than 1.5 years
>> now. Currently, we hold about 18TB of data on 3 storage nodes. Since
>> we hit 84% in utilization, we have recently decided to expand the
>> storage with more disks.
>>
>> In order to do that, after creating a new c0d4p1 partition in each of
>> the storage nodes, we ran the following commands on our proxy server:
>>
>> swift-ring-builder account.builder add z1-192.168.1.3:6002/c0d4p1 100
>> swift-ring-builder container.builder add z1-192.168.1.3:6002/c0d4p1 100
>> swift-ring-builder object.builder add z1-192.168.1.3:6002/c0d4p1 100
>> swift-ring-builder account.builder add z2-192.168.1.4:6002/c0d4p1 100
>> swift-ring-builder container.builder add z2-192.168.1.4:6002/c0d4p1 100
>> swift-ring-builder object.builder add z2-192.168.1.4:6002/c0d4p1 100
>> swift-ring-builder account.builder add z3-192.168.1.5:6002/c0d4p1 100
>> swift-ring-builder container.builder add z3-192.168.1.5:6002/c0d4p1 100
>> swift-ring-builder object.builder add z3-192.168.1.5:6002/c0d4p1 100
>>
>> [snip]
>
>>
>> So right now, the problem is;  the disk growth in each of the storage
>> nodes seems to have stalled,
>
> So you've added 3 new devices to each ring and assigned a weight of  100 to
> each one. What are the weights of the other devices in the ring? If they're
> much larger than 100, then that will cause the new devices to end up with a
> small fraction of the data you want on them.
>
> Running "swift-ring-builder <thing>.builder" will show you information,
> including weights, of all the devices in the ring.
>
>
>
>> * Bonus question: why do we copy ring.gz files to storage nodes and
>> how critical they are. To me it's not clear how Swift can afford to
>> wait (even though it's just a few seconds ) for .ring.gz files to be
>> in storage nodes after rebalancing- if those files are so critical.
>
>
> The ring.gz files contain the mapping from Swift partitions to disks. As you
> know, the proxy server uses it to determine which backends have the data for
> a given request. The replicators also use the ring to determine where data
> belongs so that they can ensure the right number of replicas, etc.
>
> When two storage nodes have different versions of a ring.gz file, you can
> get replicator fights. They look like this:
>
> - node1's (old) ring says that the partition for a replica of /cof/fee/cup
> belongs on node2's /dev/sdf.
> - node2's (new) ring says that the same partition belongs on node1's
> /dev/sdd.
>
> When the replicator on node1 runs, it will see that it has the partition for
> /cof/fee/cup on its disk. It will then consult the ring, push that
> partition's contents to node2, and then delete its local copy (since node1's
> ring says that this data does not belong on node1).
>
> When the replicator on node2 runs, it will do the converse: push to node1,
> then delete its local copy.
>
> If you leave the rings out of sync for a long time, then you'll end up
> consuming disk and network IO ping-ponging a set of data around. If they're
> out of sync for a few seconds, then it's not a big deal.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp


Follow ups

References