yellow team mailing list archive

Thread
Date
Timings from data center

To: "Francis J. Lacoste" <francis.lacoste@xxxxxxxxxxxxx>, Robert Collins <robert.collins@xxxxxxxxxxxxx>, Launchpad Yellow Squad <yellow@xxxxxxxxxxxxxxxxxxx>
From: Gary Poster <gary.poster@xxxxxxxxxxxxx>
Date: Wed, 16 May 2012 15:01:04 -0400
Cc: Liam Young <liam.young@xxxxxxxxxxxxx>
In-reply-to: <rt-3.8.7-3011-1337101728-715.50242-21-0@rt.admin.canonical.com>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1
Summary:

The data center machine is a hyperthreaded 4 core machine--8 effective
cores--with 12 GB RAM.

The tests seem to reinforce that having enough memory is essential,
which is a lesson we also learned on EC2.  Moreover, 2.4 GB/core seems
necessary for the tests (remember that our test runs also store all disk
"writes" to RAM in the ephemeral instances).  That said, once you get up
to 32 cores on EC2, that rate of RAM was less necessary: on EC2, 60 GB
for 32 cores did not use any swap at all.  Perhaps there is a constant
or logarithmic element in addition to a linear element of the equation.

When running tests with up to 5 cores in the data center (which was as
high as the RAM headroom would let us go quickly), we had a roughly
constant overhead of under 10 minutes and a divisible time of tests of
about 280 minutes.  That meant that five cores were just above the hour
mark.  If we had enough RAM, the equation would suggest that 8 cores
might take us as low as 40 minutes; we observed 50 minutes on the
(differently specced) EC2 machine.  As it was, with apparently
insufficient RAM, the eight and six core runs also ran at about the hour
mark.

Recall that a previous email I sent to the RT contained EC2 timings that
also seemed to support these conclusions.

As preparation for this email, I also ran two passing test runs on a
hyperthreaded 16 core EC2 machine, with an effective 32 cores, and 60 GB
RAM.  As mentioned above, it did not go into swap (it didn't have swap).
This took 36.5 minutes.  The last five minutes of the run had a lot of
the processes finished; if testr could spread things around better
(understanding layers better might be necessary) perhaps we could get
that closer to the half hour mark.

I'll leave this up to Francis, Robert, and IS now to determine what kind
of machines to get for our two slaves.

If you'd like lots more details as to what I did, read on.  Otherwise, I
think that's a decent summary.

Details:

The format is HOURS:MINUTES:SECONDS.MILLISECONDS.

These are records of individual runs, not averages.  If we want a bigger
sample size, just ask.  I haven't seen a lot of variance on the EC2
machines, and the successful runs I've had in the data center have been
within about 4 minutes of one another.

8 cores:
1:02:57.579
59:27.242

6 cores:
1:02:45.707
1:05:35.201

5 cores:
1:04:53.690

4 cores:
1:16:44.012
1:13:45.777

2 cores:
2:26:31.700

1 core:
4:44:10.200

I was a bit confused about the similarity between the 8, 6 and 5 core
and times, since we saw a definite difference between these concurrency
levels on the 8-core EC2 machine concurrency tests.  I investigated a
bit, and I suspect that the primary problem is that the machine only has
12G.  We determined in our tests on EC2 that you need the machine to
have at least 2G per core/concurrent LP process, and perhaps more.  If
that's correct then we would expect the test times to continue scaling
linearly down if this machine had at least 16G.  For reference, the 8
core EC2 machines we are using (m2.4xlarge) have 68.4 GB RAM, and run
these tests somewhere between 49 and 52 minutes, depending on the run.

Our hypothesis from ec2 was that the time spent on a test run is a
combination of a constant, representing layer setup time that must be
duplicated on every lxc container's test run; plus linearly divided
effort.  Put another way, [time spent on a test run] = [total time
performing actual tests]/[number of
cores] + [layer setup time, duplicated for each process]

If we ignore the 8 core timings above because of the memory issue, and
do the math comparing the 6 core and 1 core times, rounding the run time
minutes, I get these results:

total time performing tests (not setting up layers) = about 265 minutes
(4:25)
layer setup time duplicated on each core = about 19 minutes

If this is true, then...

4 cores should be 85.25 minutes (1:25)
2 cores should be 1571.5 minutes (2:32)

As you can see, that's within 10 minutes of the observed times--not a
very good match, to be honest.

If I use the timings for the 4 core runs instead of the 8 core runs, I
get about 279 minutes of work and 5 minutes of setup; this predicts the
2 core number within a minute or two (2:25 predicted, 2:26.5 actual).  5
cores would be predicted at 61 minutes, and we actually got about
65--pretty close.  6 and 8 cores start futzing out, again, in my
estimation because of the low RAM.

Because of all this, I'm inclined to say that we need a bit *more* than
2 GB per process for these tests--maybe about 2.4 GB each.  Since each
process also has all disk writes stored to memory as well, that seems
somewhat reasonable.

I should add that I tracked free -m during one of the test runs, and
swap was only used a bit, but perhaps only a bit is all it takes to
noticeably affect the speed.

Both Francis and Robert have asked about using one of the cc2.8xlarge
instances on ec2 to see how that affects the tests.  Since we are now at
the stage of trying to decide what machine to buy, I decided to try and
pursue that now as well.  For reference, these are 16 cores,
hyperthreaded to 32 cores, with 60.5 G RAM (not quite enough according
to our calculations, but close, too small either by 3.5GB @ 2GB/process
or by 13.3GB @ 2.4GB/process).  This was a finicky setup, requiring some
tweaks to wait enough time for all the LXC containers to spin up.

Once we had that, the first time was 36 minutes 27 seconds, and the
second was 35 minutes 12 seconds.  I noticed that the last five minutes
of both runs were spent with very few of the LXC containers still
running.  In fact, the first LXC container to be finished in the second
run was done more than 10 minutes before the last LXC container finished.

I roughly guess that this 32 core instance could be done about 5 minutes
faster if testr could spread the tests around better.  I suspect that
the issue is that testr does not know that layers are not tests, and the
timings for them throw its calculations off.

This concludes the report on the timings for the data center test
machine, and the 32 core EC2 machine.  For easy reference, I include
below the copy of the email I sent on March 22 to RT 50242 about the
timing information we gathered on EC2 eight core machines.

Thanks

Gary




---------------------------------------------------------
[Email from myself to RT 50242 on March 22]

I am reporting test results so far.

This is the summary.

On an EC2 m2.4xlarge, cores affect test run time linearly up to eight
cores, approximately following [time] = 4 hours/[number of cores] + 20
minutes.  This may be enough information for Robert and Francis to
advise IS on the kind of processor we want.  I expect we will want some
additional tests on the machine in the data center when it is ready to
discover its slope.

Details follow.

We have run tests on an ec2 machine.  The following are all on
m2.4xlarge("High-Memory Quadruple Extra Large Instance") instances.
These are the specs, taken from http://aws.amazon.com/ec2/instance-types/ :

   68.4 GB of memory
   26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
   1690 GB of instance storage
   64-bit platform
   I/O Performance: High

We used buildbot to run and time the tests, with a set up that can be
duplicated by following the steps described the in the Juju buildbot
master README file for initializing a Launchpad test environment.  We
hacked /usr/lib/python2.7/dist-packages/testrepository/testcommand.py to
report different local concurrency levels and otherwise ran all tests
identically, by forcing a build.  Example:

     def local_concurrency(self):
         return 3

Tests were assigned to each process by testr using round-robin.  We ran
the eight core version 5 times, but all the others were only run once.
Times were obtained by looking at the buildbot master for the time
buildbot recorded for running "testr run --parallel," as found on pages
such as /builders/lucid_lp/builds/0/steps/shell_8.  Each test run has
fewer than five failures, although the failures vary across runs.
Values here are rounded to the nearest minute.

    1 core:  4:17
    2 cores: 2:23
    3 cores: 1:40
    4 cores: 1:21
    5 cores:
    6 cores: 0:59
    7 cores:
    8 cores: 0:51

These times roughly correspond to the following equation:
[time] = 4 hours/[number of cores] + about 20 minutes

I do not plan to run 5 core and 7 core tests unless requested.

For interest, if you do not use the /dev/random hack I mentioned
previously, you get these sorts of results:

1 core without /dev/random hack: 4:50
8 cores without /dev/random hack: 3:47

To get a comparable idea of performance on the machine in the data
center, we probably should run tests with [max] cores, 1 core, and
[max/2] cores.  We can extrapolate a line from that and roughly verify
it, assuming that it is linear.

That said, I think we already have reasonable evidence that the parallel
tests do scale roughly linearly up to eight cores.  I believe that this
should inform Robert and/or Francis on what kinds of processors would
bring us the desired balance of improvement versus cost.

Thanks

Gary
Follow ups

Re: Timings from data center
From: Gary Poster, 2012-05-16