← Back to team overview

maas-devel team mailing list archive

Re: Scaling to 72k nodes

 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/18/2012 6:00 PM, Francis J. Lacoste wrote:
> On 12-10-17 05:02 AM, John Arbash Meinel wrote:
>> 18 nodegroups, each nodegroup has 4,000 nodes for 72,000 total
>> nodes. Each node has a ~25kB XML string associated with it.
> 
> The 1 cluster controller can host 4k nodes is an assumption we
> haven't empirically validated.
> 
> We should probably either validate it, or try variation on the the 
> number of nodes to clusters ratio (say with 500 per node, 1000,
> 2000), just to see how the scaling behave differently.
> 
> Cheers

For the purposes of what these clusters are doing, having more cluster
controllers and fewer nodes each can really only be faster, rather
than slower.

I'm willing to test that out a bit.

In the end, I did get HAProxy (c1.medium = 2cpu) running in front of 2
Region Controllers (c1.xlarge = 8cpu) backed by 1 DB host (c1.xlarge).
After getting that working, I ran iftop and noticed that the
bottleneck is not region controller CPUs, but networking bandwidth.
Specifically, we saturate 600+Mbps (most of a 1Gbit link) while these
tags are updating. Which means that we might be able to scale up if
the cluster workers know about a round-robin request against region
controllers (theoretically the bandwidth is switched, so you can get
1Gbps to each server), but if we hide it behind an HAProxy, then the
HAProxy bandwidth becomes the bottleneck.


I do have a patch for that which just landed, which enables Django's
gzip middleware, and patches the cluster controllers to pass the
'Accept-Encoding: gzip'. This should even help web responsiveness for
any large pages, so it seemed like a generally good thing.

I hacked some of the nodes to test it, and on nodes that had
compression enabled the time dropped from 22s down to 15s. I'm waiting
to get it packaged so I don't have to manually hack ~20 instances.


I can certainly come back to play with the specific layouts more,
though I'll need to submit the request for more instances to Amazon.
All indications seem strong that the negative scaling would be for
having fewer clusters with more nodes (say 10,000 nodes per cluster).

At least for tags, the actual processing is "embarrassingly parallel".
So if you want to throw more workers into the mix, it will happily
accept it. If we felt it was super critical, we could find a
granularity at something smaller than a whole cluster size, and have
the processing work on that (we already process in 1000 node batches,
we would just need a way to divy that work up differently).

I'll poke one more time, but being able to say "rebuilding tags from 1
Region Controller to 20 Cluster Controllers" is network bandwidth
bound, rather than CPU bound or DB load bound is a pretty good place
to be.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (Cygwin)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCAQ2MACgkQJdeBCYSNAAMjqQCaAzD3Udi7wIQ4+JXjpH8Bgt0c
y28An2xyY3O9x4aj5ltJ5bV4inTXWRh0
=lOM9
-----END PGP SIGNATURE-----


Follow ups

References