yellow team mailing list archive

Thread
Date

Re: Timings from data center

To: Robert Collins <robert.collins@xxxxxxxxxxxxx>
From: Gary Poster <gary.poster@xxxxxxxxxxxxx>
Date: Fri, 18 May 2012 17:14:28 -0400
Cc: Launchpad Yellow Squad <yellow@xxxxxxxxxxxxxxxxxxx>, "Francis J. Lacoste" <francis.lacoste@xxxxxxxxxxxxx>, Liam Young <liam.young@xxxxxxxxxxxxx>
In-reply-to: <CAJ3HoZ2dZ-p+UusKE3iHyAEKuw+AeZq2J-yC61uuAi0rqn94xQ@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

On 05/16/2012 03:59 PM, Robert Collins wrote:
> This is great news.
> 
> So roughly 10m + 19m + 265m/workers. Neato.
> 
> Testr will ignore the layers if you configure the new option we added;
> it will still show a layer that fails to startup. This may help with
> the idle time aspect.
> 
> What do you think of us getting say 16cores, hyperthreaded, *one*
> machine. That would make the action from idle snappier, at the cost of
> either contention when there are serial landings, or complexity in
> buildbot to say 'only run one test at a time, devel || db-devel'.

The serial approach is trivial.  I've gotten it working and tested it,
and it's fine.  It turns out that BuildSlaves have a "max_builds"
keyword argument available at instantiation.  It defaults to None
(infinite) and we can easily set it to 1.  It would be less than five
minutes of work in the data center.  We'd probably want to expose this
in the juju charm as well, for our own tests, but that would be very
easy.  I've hacked on it just a bit already.

In contrast, the concurrent/contention approach is going to be at least
somewhat expensive to deal with. I tried a trivial experiment with an
immediate failure, and realized that we would have to deal with two
separate underlying LXC containers for the ephemerals, because we build
and update in the real container before switching to the ephemerals for
the tests.  That's certainly fixable, but will require time and work.
Beyond that, I expect more challenges.  If we go for the fully parallel
case, I would bet money that we will have additional problems because of
CPU contention.

If you'd like further exploration of the concurrent approach, please
ask, but I would be personally much happier to stick with the serial
approach.  Francis directed proceeding on the serial approach for now.

>From a developer's perspective, the serial approach has interesting
tradeoffs in comparison to a two machine approach.  On the one hand,
having to wait for a landing on devel would be somewhat more frequent
with the serial approach, because a landing to devel will always be
immediately followed with a landing to db-devel, which will block.  If
you had two machines, there would be no blockage.

On the other hand, developers might get their changes pushed to db-devel
slightly faster, because the 16 core machine should be about 10 minutes
faster per run than an 8 core machine would.

For developers, I'd argue the balance is in favor of the two-machine
approach.  However, if other planning factors mean that a single machine
wins, the developer story is still way better than now.

> 
> Relatedly, ec2land - do you think the HVM instance would be cost
> effective for ec2land/ec2 test? It sounds like it has great results @
> 36 minutes, but perhaps just an 8 core is sufficiently good @ 51
> minutes?

It's US $2.40 for the big machine versus US $1.80 for the eight core
(versus 5 hours * US $.64/hour = US $3.20 today, I think).  Both of
those are cheaper, and I don't think an extra $.60 will break the bank
for an additional 15 minutes of speed.

OTOH, I suspect that getting the tests to run reliably on the 32 core
machine won't be cost effective, in terms of developer time.  I know
that there are at least a couple of new bugs lurking there.

> Ideally we'd have a super fast environment totally spun up and devs
> could reuse it individually, reducing setup cost and really tuning
> things; I think that is something for the next pass - definitely out
> of scope for this project. (It probably needs to be in canonistack, it
> probably needs N-machine scaling at that point, and other non-trivial
> additional works.

Diminishing returns will hit at some point, of course.  Sounds
interesting though.

Gary

Follow ups

Re: Timings from data center
From: Robert Collins, 2012-05-20

References

Timings from data center
From: Gary Poster, 2012-05-16
Re: Timings from data center
From: Gary Poster, 2012-05-16
Re: Timings from data center
From: Robert Collins, 2012-05-16