← Back to team overview

maas-devel team mailing list archive

Re: Scaling to 72k nodes

 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/16/2012 6:24 PM, Martin Packman wrote:
> Thanks for the write up John!
> 

It looks like I made a mistake in my setup. I just logged on this
evening to see how postgres itself would handle processing xpath for
us. Just as a point of reference. However, it ended up that I wasn't
properly setting the hardware details for all nodes (I was generating
them, but not actually calling set_hardware_details(new_xml_str).).

Having done that, it would appear that I only had 12,000 nodes that
actually had hardware details, which explains why those clusters are
slow, and the others are super fast. So, I'll come back with more info.

The initial results is that 'select xpath(hardware_details)' takes
about 64s on 72,000 nodes, and 'select
sum(char_length(hardware_details::text))' takes 13s. So just
extracting the raw text is taking 13s, which should be a hard-lower
bound on processing. However, we can beat the 64s because of parallelism.

With 18 clusters, I see:

1 tag		21s
2 tags		40s

If I cheat and use python to just issue a processing request for a
single cluster worker, it completes in 10s. So we are clearly
saturated already.

So instead, I changed the script to just ask N workers to rebuild. And
this is what I got:
cluster 1 tag	2 tags
1	9	9
2	10	11
3	10	12
4	10	13
5	11	14
6	11	14
7	12	16
8	13	18
9	13	19
10	13	21
11	14	23
12	15	27
13	14	27
14	14	30
15	19	33
16	18	34
17	19	37

The graph is attached. You can see pretty clearly that the graph is
pretty much dependent on how many actual tags need updating (2 tags
running across 2 clusters is the same as 1 tag across 4 clusters.)
This means that with this exact hardware, growing to 25 clusters and 1
tag should take ~25s to update.


72k nodes * 25kB is also 1.8GB of data being sent over the network for
one rebuild. (So 2 tags, 18 clusters, is >3.6 GB of data being
transferred.)

Interestingly, I don't see any obviously big changes in the speed
based on load. It slows down as load goes up, and seems to be
nonlinear [e^0.04x fits at R^2=.99], but we aren't hitting a massive
transition point.

If we want to keep poking this, the next obvious thing is to separate
the DB from the appserver, and then try to scale up the appservers/see
the load on the DB.

John
=:->



> On 16/10/2012, John Arbash Meinel <john@xxxxxxxxxxxxxxxxx> wrote:
>> 
>> Earlier in the week I had 10 Cluster Controllers, each on a
>> c1.medium, which is 2-cores of the same speed as the c1.xlarge.
>> Today I added an additional 8 Cluster Controllers (because I'm
>> currently limited to 20 EC2 instances, and I need to figure out
>> how to change that).
> 
> You might want to poke James Page on IRC as he needed his limit
> bumped up in the past, unfortunately I think it's not very easy to
> navigate the Amazone bureaucracy.
> 
>> One interesting bit is the 'fairness' of the system. It appears
>> that all the cluster controllers get the job request at
>> approximately the same time, but I end up getting most of them
>> completing the job in about 3.5s, but then a second wave of them
>> take 13s to complete. My guess is that it has something to do
>> with keep-alive. The Region controller currently has 12 'wgsi'
>> workers, (though I really only see 8 of them with active CPU at
>> any one time).
> 
> Was it clear at which stage the delay was on for the second batch? 
> Waiting for the hardware details, or when posting back the
> matching nodes? Fiddling with the number of workers does sound like
> it may affect things.
> 
> Martin
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (Cygwin)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlB9p0cACgkQJdeBCYSNAAOJIgCfQICklemrHMOd0wJp5GAiIRQQ
PCYAn30dLPbNpd0Ja/H4gilLJgeqLeYT
=bTXv
-----END PGP SIGNATURE-----

Attachment: tagsvstime.png
Description: PNG image


References