← Back to team overview

launchpad-dev team mailing list archive

Re: Performance tuesday: faster development

 

Hi all,

This seems like a really exciting proposal.  It describes a world
significantly more awesome than the current one, and inspires the
following slightly incoherent thoughts.


Part of this proposal seems to be saying "by taking away something we're
good at, we'll force ourselves to get better at something we're bad at"
-- and while that's not without logic, it's also makes me go "hmm".


I think an advantage that we'll gain from all this is greater fault
tolerance.  Because we'll have more services to fail, they will fail
more often, and if we don't get better at this we'll present a really
terrible user experience (compared to the way LP just doesn't work
without the librarian today).


I also think we'll need to develop a set of patterns for how we
deploy/organize our services.  Something like an easy way to deploy a
service behind haproxy, with conventions for how haproxy will probe the
service for health and a set of scripts for doing nodowntime service
upgrades (and rollbacks).  It almost makes me wonder whether we should
standardize on a vm based approach (like all the cool kids are doing)
and use things like ensemble to do our deployments (and even a chaos
monkey[0] to ensure reliability?).  Something about api versioning goes
in here too I guess.

[0] http://www.readwriteweb.com/cloud/2010/12/chaos-monkey-how-netflix-uses.php


We do have an example that's similar in many ways to what you describe
in LP today: codehosting.  It uses XML-RPC rather than database access
to communicate with the rest of Launchpad.  It does have some of the
advantages you describe -- particularly the faster tests using a fake
thing.  It does have some pain points too -- keeping the fake up to date
is pretty boring, and it's not a complete fake (for example, it doesn't
model how private branches are visible to subscribers -- which leads me
on to remarking that the interface the service codehosting uses is not
'complete' in some sense -- it lacks many write operations on branches
such as renaming.  If we are SOA everywhere, I guess this part will go
away).  Another point about the codehosting example is performance -- to
make the branch-distro.py script work with acceptable performance I had
to create yet another implementation of the service that talked to the
database directly, which is something of a shame.  That was back in the
day when the internal xml-rpc server was a SPOF, maybe it would be
better now.  But also maybe Zope isn't the right tech to build services
with...


You say that XML-RPC might be a good default choice for a protocol,
which sounds basically sane to me, although the error reporting makes me
want to cry (another lesson from the codehosting experience --
exceptions raised by the createBranch model method get translated to
XML-RPC faults which then get translated to the exceptions the bzrlib
transport API uses).  We can come up with a better convention here like
putting the class name of the exception to raise as the faultString or
something, but then we're going beyond XML-RPC to the extent that you'll
be wanting to use some kind of specialized client rather than stock
xmlrpclib.  This may not be so bad, but it also means that the
underlying protocol isn't very important, it's the Python (or more
generally, RPC I guess) API that's important, and the protocol is a
detail.  I also don't know of a protocol that has what I would consider
good error handling built into it though.

Semi-relatedly, when you insist that a service has a performant
fake for using in tests, do you envision this being an actual network
service, or would an in-memory fake suffice?  An in-memory fake will
likely perform better, but perhaps not that much.

Having dismissed the choice of protocol as a detail, have you considered
gearman or celery as a default protocol choice?


For more concrete comments on the document, there are two sentences that
I just plain don't understand:

"But actually sitting on one schema implies one large bucket within
which changes can propagate : and we see this regularly when we deal
with optimisations and changes to the implementation of common layers
(such as the team participation graph-caching optimisation)."

"One way to avoid that when we don't know what we want to integrate
(yet) is to focus on layers which can be reused across Launchpad rather
than on (for instance) breaking out user-visible components. This
doesn't preclude vertical splits but it is a lot easier to reason about
and prepare for."

In the section "Identified service opportunities" I think it would be
good to explain a bit more what the services described actually do.

Finally, would it be possible to sketch in some detail how a particular
page might be produced in the new world?  I think the branch merge
proposal page might be interesting -- it would use a good few of the
proposed services.

Cheers and apologies for any incoherence,
mwh


On Tue, 17 May 2011 15:01:55 +1200, Robert Collins <robertc@xxxxxxxxxxxxxxxxx> wrote:
> Nearly a year ago now when I started working on Launchpad (again :P)
> we faced a huge performance problem. We're over half way there now:
> our request backstop is set to 9 seconds (with 3 overrides). This is
> down from 20 seconds. We have approximately the same number of
> requests failing a day - well under a tenth of a percent across the
> site.
> 
> This is pretty damn awesome!
> 
> We've found and corrected a huge number of inefficient pages which
> simply did too many queries, and others which had mistakes in their
> SQL queries. We've also improved a number of query schemas. Needless
> to say doing all this work has involved changes in some of our
> toolchain (such as allowing model level caching).
> 
> And a month or so back when writing up the changes to our
> infrastructure that we did to address the infrastructure issues
> driving some aspects of our poor performance, I noted that we've cross
> a significant perceptual threshold: we're no longer primarily
> perceived as slow.
> 
> This gives us the breathing room to look at the next major performance
> issue: our development cycle. A few things feed into this:
>  - Its getting harder to fix performance bugs simply: accessing 60K
> rows of cold data @ 2ms each is always going to be a 2 minute
> operation. We need more sophisticated solutions to handle the scale of
> some of our problems. Adding such solutions is tricky and often
> requires multiple iterations, but we can only iterate once a month due
> to downtime constraints.
>  - We have a code base where we routinely make changes with unexpected
> side effects, which hampers development. Sometimes they escape and
> become regressions (we added about a week of work in this way over the
> last 5 months).
>  - Running enough tests to be confident that the whole test suite will
> pass is really quite hard.
>  - Making reusable components is very tricky because of the tight
> coupling between our domain model and object persistence
> 
> Many of these things have been discussed before. I have a proposal
> which I would like your joint help critically assessing. It is by *no
> means* a done deal nor finalised.
> 
> The proposal is the first of three documents I intend us to have on
> this (large) topic:
>  * The analysis / overview / business case
>  * A vision stripping that analysis to its bare bones, establishes a
> framework for answering questions like 'should X be a service' and
> makes considered but opinionated choices about technology.
>  * A migration roadmap which identifies ordering, costs and benefits
> from the various things that go into a multigeneration massive
> migration.
> 
> In this proposal I have deliberately not made choices (such as 'rabbit
> vs xmlrpc vs restful json vs ...) which do not affect the overall
> discussion. I'm positive we'll have a fine old time deciding on
> different implementation choices; we should decide on the overall
> approach before making such choices though :) [what should we do, when
> should we do it and how should we do it... in that order when possible
> ]
> 
> I've spoken to some of you already about this - thank you -very- much
> for your feedback on the proposal so far. I owe you all! The list at
> the top of the document is probably not complete - some of the ideas
> have been around (literally) for years.
> 
> With no further ado:
> 
> https://dev.launchpad.net/ArchitectureGuide/ServicesAnalysis
> 
> Please read this and do one of:
>  - comment in it
>  - reply to this thread
>  - reply to me privately
> 
> depending your personal preferences.
> 
> If the proposal survives this feedback process then I'll start digging
> into the juicy stuff - the other two documents I mention above.
> 
> -Rob
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp


Follow ups

References