launchpad-dev team mailing list archive

Thread
Date
Re: Performance tuesday: faster development

To: Michael Hudson-Doyle <michael.hudson@xxxxxxxxxxxxx>
From: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
Date: Wed, 18 May 2011 15:02:45 +1200
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <87tycsahgn.fsf@canonical.com>
On Wed, May 18, 2011 at 2:08 PM, Michael Hudson-Doyle
<michael.hudson@xxxxxxxxxxxxx> wrote:
> Hi all,
>
> This seems like a really exciting proposal.  It describes a world
> significantly more awesome than the current one, and inspires the
> following slightly incoherent thoughts.

Thanks for sharing them...

> Part of this proposal seems to be saying "by taking away something we're
> good at, we'll force ourselves to get better at something we're bad at"
> -- and while that's not without logic, it's also makes me go "hmm".

I can see that. OTOH Its something we're good at because we're
accomodating the other issues in the system, if we can remove those we
don't need to be good at it.

> I think an advantage that we'll gain from all this is greater fault
> tolerance.  Because we'll have more services to fail, they will fail
> more often, and if we don't get better at this we'll present a really
> terrible user experience (compared to the way LP just doesn't work
> without the librarian today).

:) Thats certainly a component, but your math is flawed; the absolute
number of appservers may not change - we may just shuffle. The failure
rates for each service may be different. The backend services will all
be haproxied or similar. So I don't think its as simple as 'more
components - more failures to deal with'. Possible more /sorts of
failures/ - thats something that I think needs attention; but not
failures-per-day.

> I also think we'll need to develop a set of patterns for how we
> deploy/organize our services.  Something like an easy way to deploy a
> service behind haproxy, with conventions for how haproxy will probe the
> service for health and a set of scripts for doing nodowntime service
> upgrades (and rollbacks).  It almost makes me wonder whether we should
> standardize on a vm based approach (like all the cool kids are doing)
> and use things like ensemble to do our deployments (and even a chaos
> monkey[0] to ensure reliability?).  Something about api versioning goes
> in here too I guess.
>
> [0] http://www.readwriteweb.com/cloud/2010/12/chaos-monkey-how-netflix-uses.php

I think that that will come naturally; as far as vms vs puppet, I'm
inclined to stick with the current toolchain IS are using - its a
variable we don't need to change.

> We do have an example that's similar in many ways to what you describe
> in LP today: codehosting.  It uses XML-RPC rather than database access
> to communicate with the rest of Launchpad.  It does have some of the
> advantages you describe -- particularly the faster tests using a fake
> thing.  It does have some pain points too -- keeping the fake up to date
> is pretty boring, and it's not a complete fake (for example, it doesn't
> model how private branches are visible to subscribers -- which leads me
> on to remarking that the interface the service codehosting uses is not
> 'complete' in some sense -- it lacks many write operations on branches
> such as renaming.  If we are SOA everywhere, I guess this part will go
> away).  Another point about the codehosting example is performance -- to
> make the branch-distro.py script work with acceptable performance I had
> to create yet another implementation of the service that talked to the
> database directly, which is something of a shame.  That was back in the
> day when the internal xml-rpc server was a SPOF, maybe it would be
> better now.  But also maybe Zope isn't the right tech to build services
> with...

I think that zope is a likely failure here, that and the internal
xmlrpc service was horrendously overloaded a year+ ago.

But also it may well have needed a different api to match the scripts needs.

> You say that XML-RPC might be a good default choice for a protocol,
> which sounds basically sane to me, although the error reporting makes me
> want to cry (another lesson from the codehosting experience --
> exceptions raised by the createBranch model method get translated to
> XML-RPC faults which then get translated to the exceptions the bzrlib
> transport API uses).  We can come up with a better convention here like
> putting the class name of the exception to raise as the faultString or
> something, but then we're going beyond XML-RPC to the extent that you'll
> be wanting to use some kind of specialized client rather than stock
> xmlrpclib.  This may not be so bad, but it also means that the
> underlying protocol isn't very important, it's the Python (or more
> generally, RPC I guess) API that's important, and the protocol is a
> detail.  I also don't know of a protocol that has what I would consider
> good error handling built into it though.

I agree with all of this.

> Semi-relatedly, when you insist that a service has a performant
> fake for using in tests, do you envision this being an actual network
> service, or would an in-memory fake suffice?  An in-memory fake will
> likely perform better, but perhaps not that much.

I think a network fake would give us more flexability for integrating
with heterogeneous versions of components (both shallowly for things
like django/pastedeploy vs zope and more deeply for things like
rewriting a performance critical component in a fast language).

> Having dismissed the choice of protocol as a detail, have you considered
> gearman or celery as a default protocol choice?

Yes :) I think this is something we'll drill into when we start
looking at making some of the choices implied by the proposal.

> For more concrete comments on the document, there are two sentences that
> I just plain don't understand:
>
> "But actually sitting on one schema implies one large bucket within
> which changes can propagate : and we see this regularly when we deal
> with optimisations and changes to the implementation of common layers
> (such as the team participation graph-caching optimisation)."

What I mean here is that the knock on effects of changing the team
membership cache are extensive: dozens or hundreds of queries have to
be rewritten (and reprofiled).

> "One way to avoid that when we don't know what we want to integrate
> (yet) is to focus on layers which can be reused across Launchpad rather
> than on (for instance) breaking out user-visible components. This
> doesn't preclude vertical splits but it is a lot easier to reason about
> and prepare for."

Here I mean that taking one of our domains like 'bugs' and splitting
it out is harder than taking something like the team membership cache
('teamparticipation') and splitting that out.

> In the section "Identified service opportunities" I think it would be
> good to explain a bit more what the services described actually do.

We may want to move them to separate documents or even LEPs. Some of
them are getting pretty concrete.

> Finally, would it be possible to sketch in some detail how a particular
> page might be produced in the new world?  I think the branch merge
> proposal page might be interesting -- it would use a good few of the
> proposed services.

Sure. Let me give it a rough go here, and if thats understandable we
can move it to the wiki.

---- this is a strawman. It need not have any bearing on reality ---

client -> apache(SSL) -> haproxy(load balancing) -> template server
template server then uses backend apis to perform traversal [perhaps
by breaking the url into path segments and doing lookups for a pillar,
then related objects etc, or perhaps by passing the url as a whole to
the backend.
With a resulting object to render, a view is instantiated.
The view __init__ does something like this:
try:
 calls.timeline = get_timeline (self.context)
 calls.diffs = get_diffs(self.context)
 calls.summary = get_votes_and_summary(self.context)
finally:
 calls.gather()
self.timeline = calls.timeline.result
self.diffs = calls.diff.result
self.summary = calls.summary.summary
self.votes = calls.summary.votes

and the template does view/timeline view/diffs view/summary and
view/votes to get at the various bits.

Each of those get_ things was a thread-dispatched sync backend request
using $protocol and returning loosely structured data - perhaps mapped
to model objects, or perhaps just basic types. Any mapping to model
objects would be done in the template server.

-Rob
Follow ups

Re: Performance tuesday: faster development
From: Aaron Bentley, 2011-05-18
Re: Performance tuesday: faster development
From: Michael Hudson-Doyle, 2011-05-18
References

Performance tuesday: faster development
From: Robert Collins, 2011-05-17
Re: Performance tuesday: faster development
From: Michael Hudson-Doyle, 2011-05-18