maas-devel team mailing list archive

Thread
Date

Re: Scaling to 72k nodes

To: Stuart Bishop <stuart.bishop@xxxxxxxxxxxxx>
From: John Arbash Meinel <john@xxxxxxxxxxxxxxxxx>
Date: Wed, 17 Oct 2012 13:02:00 +0400
Cc: maas-devel@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CADmi=6ODWPrZPfv2Q6Qja=eWNLRjDrNzW_wEdd0PWj+A+e0N3A@mail.gmail.com>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20121010 Thunderbird/16.0.1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/17/2012 10:38 AM, Stuart Bishop wrote:
...

> A rule of thumb is to allow (#cores +  #drives) x idle%
> simultaneous requests. So an 8 core system writing to 5 mirrored
> drives where we find that 20% of the time the db connections are
> idle (waiting for client side processing) (8+5)*1.2 =16
> simultaneous requests. It is interesting that you don't see a
> benefit beyond just #cores. Are you are benchmarking a read heavy
> operation? If you are not having to worry about the disk subsystem,
> with PG 9.1  you should scale up to 24 cores reasonably happily,
> and PG 9.2 64 cores.
> 
> PG 9.2 has now been released - It would be interesting to see
> results if you dropped that in. You won't be shipping with 9.2 yet,
> but there are changes in 9.2 that could significantly speed things
> up.
> 

This particular test is definitely read heavy. In this test the
database has:

 18 nodegroups, each nodegroup has 4,000 nodes for 72,000 total nodes.
 Each node has a ~25kB XML string associated with it.

 2 tags
 a (node_id, tag_id) table matching nodes to tags

We will have 1 postgres db, and N wsgi appservers (single threaded),
hopefully spread across multiple actual machines (I'm just now testing
adding 1 more appserver, rather than hosting the appserver on the same
machine as the database).

When the work starts, each nodegroup worker first requests a list of
ids for its nodes from the appservers (4000 system_ids at 41 bytes
each =~ 164kB*18=3MB of requests).

It then requests a batch of 1000 hardware details at a time, and does
some XML XPath processing on them. This is where the bulk of the data
is read (25kB*72knodes = 1.8GB of data). Postgres is a bit loaded here
in uncompressing all of that data from its Toast tables. Note that
1.8GB fits easily into the machine's 7GB of RAM, and probably with
room to spare because the XML compresses well (24kB => 4.2kB.gz).

The raw XML gets serialized into a JSON response by the Appservers,
and deserialized by the nodegroup workers, which then do the XPath
processing.

When finished, for each node, it either matched the XPath or did not
match. It builds a matching and not matching list of node system_ids,
which it then posts back to the appservers, which update the (node_id,
tag_id) table. In Django it is tag.node_set.remove(*remove_ids);
tag.node_set.add(*add_ids). Hopefully that translates into reasonable
SQL 'UPDATE' and 'DELETE' statements.

- From what I can tell, most of the time is spent extracting and
shipping the information around.

In Postgres, if I do 'sum(char_length(xml_data))' it takes 13s. So
presumably that is the fastest postgres can extract the 1.8GB of data.
However, it can probably do all that as parallel as you would like it
to. So fastest on this hardware would probably be 13s/8 = 1.6s.

If I do xpath(xml_data) it takes 64s. So it should take ~3.5s for a
cluster worker to process its own nodes' data. And postgres should
take 0.7s to extract that data (subject to 0 other load).
Given the actual times (20s), I don't think the load is bothering
postgres at all, but the bottlenecks are appservers and network bandwidth.

But beyond 8 node groups, *something* is going to be waiting for
another one to finish. (or however many cores you have). It is going
to be spending far more time outside the DB, though. Certainly we are
a long way from saturation (where the next set of requests come in
before the first one is finished).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (Cygwin)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlB+dAgACgkQJdeBCYSNAAPYFACgqBWJkywG9as85k0K1BXQzUjW
DLgAoIPfVVAmTTAH3iP3HD0LJShnOMot
=U0K7
-----END PGP SIGNATURE-----

Follow ups

Re: Scaling to 72k nodes
From: Francis J. Lacoste, 2012-10-18

References

Scaling to 72k nodes
From: John Arbash Meinel, 2012-10-16
Re: Scaling to 72k nodes
From: Stuart Bishop, 2012-10-17