openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #17618
Re: Expanding Storage - Rebalance Extreeemely Slow (or Stalled?)
On 10/22/12 9:38 AM, Emre Sokullu wrote:
Hi folks,
At GROU.PS, we've been an OpenStack SWIFT user for more than 1.5 years
now. Currently, we hold about 18TB of data on 3 storage nodes. Since
we hit 84% in utilization, we have recently decided to expand the
storage with more disks.
In order to do that, after creating a new c0d4p1 partition in each of
the storage nodes, we ran the following commands on our proxy server:
swift-ring-builder account.builder add z1-192.168.1.3:6002/c0d4p1 100
swift-ring-builder container.builder add z1-192.168.1.3:6002/c0d4p1 100
swift-ring-builder object.builder add z1-192.168.1.3:6002/c0d4p1 100
swift-ring-builder account.builder add z2-192.168.1.4:6002/c0d4p1 100
swift-ring-builder container.builder add z2-192.168.1.4:6002/c0d4p1 100
swift-ring-builder object.builder add z2-192.168.1.4:6002/c0d4p1 100
swift-ring-builder account.builder add z3-192.168.1.5:6002/c0d4p1 100
swift-ring-builder container.builder add z3-192.168.1.5:6002/c0d4p1 100
swift-ring-builder object.builder add z3-192.168.1.5:6002/c0d4p1 100
> [snip]
>
> So right now, the problem is; the disk growth in each of the storage
> nodes seems to have stalled,
So you've added 3 new devices to each ring and assigned a weight of 100
to each one. What are the weights of the other devices in the ring? If
they're much larger than 100, then that will cause the new devices to
end up with a small fraction of the data you want on them.
Running "swift-ring-builder <thing>.builder" will show you information,
including weights, of all the devices in the ring.
* Bonus question: why do we copy ring.gz files to storage nodes and
how critical they are. To me it's not clear how Swift can afford to
wait (even though it's just a few seconds ) for .ring.gz files to be
in storage nodes after rebalancing- if those files are so critical.
The ring.gz files contain the mapping from Swift partitions to disks. As
you know, the proxy server uses it to determine which backends have the
data for a given request. The replicators also use the ring to determine
where data belongs so that they can ensure the right number of replicas,
etc.
When two storage nodes have different versions of a ring.gz file, you
can get replicator fights. They look like this:
- node1's (old) ring says that the partition for a replica of
/cof/fee/cup belongs on node2's /dev/sdf.
- node2's (new) ring says that the same partition belongs on node1's
/dev/sdd.
When the replicator on node1 runs, it will see that it has the partition
for /cof/fee/cup on its disk. It will then consult the ring, push that
partition's contents to node2, and then delete its local copy (since
node1's ring says that this data does not belong on node1).
When the replicator on node2 runs, it will do the converse: push to
node1, then delete its local copy.
If you leave the rings out of sync for a long time, then you'll end up
consuming disk and network IO ping-ponging a set of data around. If
they're out of sync for a few seconds, then it's not a big deal.
Follow ups
References