← Back to team overview

launchpad-dev team mailing list archive

Re: Performance tuesday: faster development

 

On Wed, 18 May 2011 15:02:45 +1200, Robert Collins <robertc@xxxxxxxxxxxxxxxxx> wrote:
> > I think an advantage that we'll gain from all this is greater fault
> > tolerance.  Because we'll have more services to fail, they will fail
> > more often, and if we don't get better at this we'll present a really
> > terrible user experience (compared to the way LP just doesn't work
> > without the librarian today).
> 
> :) Thats certainly a component, but your math is flawed; the absolute
> number of appservers may not change - we may just shuffle. The failure
> rates for each service may be different. The backend services will all
> be haproxied or similar. So I don't think its as simple as 'more
> components - more failures to deal with'. Possible more /sorts of
> failures/ - thats something that I think needs attention; but not
> failures-per-day.

I think I mostly meant more kinds of failures.

> > I also think we'll need to develop a set of patterns for how we
> > deploy/organize our services.  Something like an easy way to deploy a
> > service behind haproxy, with conventions for how haproxy will probe the
> > service for health and a set of scripts for doing nodowntime service
> > upgrades (and rollbacks).  It almost makes me wonder whether we should
> > standardize on a vm based approach (like all the cool kids are doing)
> > and use things like ensemble to do our deployments (and even a chaos
> > monkey[0] to ensure reliability?).  Something about api versioning goes
> > in here too I guess.
> >
> > [0] http://www.readwriteweb.com/cloud/2010/12/chaos-monkey-how-netflix-uses.php
> 
> I think that that will come naturally; as far as vms vs puppet, I'm
> inclined to stick with the current toolchain IS are using - its a
> variable we don't need to change.

Sure, puppet is a totally adequate solution here (this actually occurred
to me over lunch).  I guess the actual point is that we want operations
like "adding another server to the haproxy pool for service $foo" and
"setting up a service built of standard components (e.g. http served by
haproxy/python/postgres)" to require little time and even less thinking
from the sysadmins -- including setting up things like nagios and
logging.

> > You say that XML-RPC might be a good default choice for a protocol,
> > which sounds basically sane to me, although the error reporting makes me
> > want to cry (another lesson from the codehosting experience --
> > exceptions raised by the createBranch model method get translated to
> > XML-RPC faults which then get translated to the exceptions the bzrlib
> > transport API uses).  We can come up with a better convention here like
> > putting the class name of the exception to raise as the faultString or
> > something, but then we're going beyond XML-RPC to the extent that you'll
> > be wanting to use some kind of specialized client rather than stock
> > xmlrpclib.  This may not be so bad, but it also means that the
> > underlying protocol isn't very important, it's the Python (or more
> > generally, RPC I guess) API that's important, and the protocol is a
> > detail.  I also don't know of a protocol that has what I would consider
> > good error handling built into it though.
> 
> I agree with all of this.

Another point that occurred after lunch: for all sorts of reasons,
accessing a remote service should be explicit in our code (so let's not
use xmlrpclib.ServerProxy?).  I'm sure you know this in your bones by
now with all the lazy loading Storm pain :) Implementing everything in
Twisted and so having deferreds bubble around would acheive this, but is
probably massive overkill!

> > Semi-relatedly, when you insist that a service has a performant
> > fake for using in tests, do you envision this being an actual network
> > service, or would an in-memory fake suffice?  An in-memory fake will
> > likely perform better, but perhaps not that much.
> 
> I think a network fake would give us more flexability for integrating
> with heterogeneous versions of components (both shallowly for things
> like django/pastedeploy vs zope and more deeply for things like
> rewriting a performance critical component in a fast language).

Yeah.

> > Having dismissed the choice of protocol as a detail, have you considered
> > gearman or celery as a default protocol choice?
> 
> Yes :)

Heh.

> I think this is something we'll drill into when we start looking at
> making some of the choices implied by the proposal.

Fair enough.

> > For more concrete comments on the document, there are two sentences that
> > I just plain don't understand:
> >
> > "But actually sitting on one schema implies one large bucket within
> > which changes can propagate : and we see this regularly when we deal
> > with optimisations and changes to the implementation of common layers
> > (such as the team participation graph-caching optimisation)."
> 
> What I mean here is that the knock on effects of changing the team
> membership cache are extensive: dozens or hundreds of queries have to
> be rewritten (and reprofiled).

Oh right.

> > "One way to avoid that when we don't know what we want to integrate
> > (yet) is to focus on layers which can be reused across Launchpad rather
> > than on (for instance) breaking out user-visible components. This
> > doesn't preclude vertical splits but it is a lot easier to reason about
> > and prepare for."
> 
> Here I mean that taking one of our domains like 'bugs' and splitting
> it out is harder than taking something like the team membership cache
> ('teamparticipation') and splitting that out.

Ah OK.

I've adjusted both explanations in the wiki.

> > In the section "Identified service opportunities" I think it would be
> > good to explain a bit more what the services described actually do.
> 
> We may want to move them to separate documents or even LEPs. Some of
> them are getting pretty concrete.

Well sure, but the first paragraph under "team participation / directory
service" doesn't appear connected to anything else in the document.
Perhaps just reversing the order of the paragraphs in this section would
be a good start :)

> > Finally, would it be possible to sketch in some detail how a particular
> > page might be produced in the new world?  I think the branch merge
> > proposal page might be interesting -- it would use a good few of the
> > proposed services.
> 
> Sure. Let me give it a rough go here, and if thats understandable we
> can move it to the wiki.
> 
> ---- this is a strawman. It need not have any bearing on reality ---
> 
> client -> apache(SSL) -> haproxy(load balancing) -> template server
> template server then uses backend apis to perform traversal [perhaps
> by breaking the url into path segments and doing lookups for a pillar,
> then related objects etc, or perhaps by passing the url as a whole to
> the backend.
> With a resulting object to render, a view is instantiated.
> The view __init__ does something like this:
> try:
>  calls.timeline = get_timeline (self.context)
>  calls.diffs = get_diffs(self.context)
>  calls.summary = get_votes_and_summary(self.context)
> finally:
>  calls.gather()
> self.timeline = calls.timeline.result
> self.diffs = calls.diff.result
> self.summary = calls.summary.summary
> self.votes = calls.summary.votes
> 
> and the template does view/timeline view/diffs view/summary and
> view/votes to get at the various bits.
> 
> Each of those get_ things was a thread-dispatched sync backend request
> using $protocol and returning loosely structured data - perhaps mapped
> to model objects, or perhaps just basic types. Any mapping to model
> objects would be done in the template server.

I think this all makes some sort of sense.  Two further thoughts spring
from this:

If there is a traversal service, it would map URLs to ... what?  Do
things that are model objects today all have some kind of unique name in
the new world (it could be as simple as bug-$id)?  I guess it could
return just a (view-name, model-object-name) pair.

The other thought is that I'm not sure the concept of a model object is
useful in this new world!  I think I'd favour returning loosely
structured data from the service (roughly the set of data types JSON
supports... although probably one would want some others, such as
dates) sounds about right to me.  I guess one won't know until something
gets implemented.

Cheers,
mwh


Follow ups

References