← Back to team overview

launchpad-dev team mailing list archive

Re: Fwd: Results of Loggerhead Bug Investigation

 

Francis J. Lacoste wrote:
> ----------  Forwarded Message  ----------
> 
> Subject: Results of Loggerhead Bug Investigation
> Date: November 19, 2009
> From: "Max Kanat-Alexander" <mkanat@xxxxxxxxxxxxxxxxxxxx>
> To: "Francis J. Lacoste" <francis.lacoste@xxxxxxxxxxxxx>
> 
> 	Hey Francis.
> 
> 	So, I investigated the memory leak and the codebrowse hanging problem.
> 
> 	The memory leak is just some part of the code leaking a tiny amount of
> memory when a specific type of page is requested (I'm not sure which
> page yet). The tiny leak grows over days until the process is very
> large.  I can reproduce the leak locally. The rest of the work involved
> in this would be tracking down where the leak occurs and patching it--I
> suspect this will not be a major architectural change, just a fix to
> loggerhead or perhaps Paste. However, I think the task of initial
> analysis is complete.

This sounds sane.

> 	The more significant issue is the hangs. The hang is, in a sense, two
> separate issues:
> 
> 	1) When a user loads multiple revisions of a very large branch
> (launchpad itself, bzr itself, or mysql) that doesn't have a revision
> graph yet, building the revision graph takes an enormous amount of time
> and causes the rest of loggerhead to slow to a crawl, thus causing it to
> appear hung for three to five minutes.

As suspected then, but it sounds worse than I'd guessed.

> 	2) Loggerhead (or perhaps just a single loggerhead instance) doesn't
> scale very well across many branches with many users, partially because
> of how the revision graph is currently built and partially (I suspect)
> because any given Python process is going to be limited by the Global
> Interpreter Lock on how many concurrent requests it can honestly handle.

Yeah.

> 	So the question for this issue is--what level would you like me to
> address it on? If you'd like me to simply work on the revision graph
> issue, I could do that within the current architecture of loggerhead and
> devise a fix. Probably the simplest would be to just place a mutex
> around building a revision graph for any one branch.

That's probably a good fix for loggerhead, but maybe not sufficient for
Launchpad.

> However, that may
> not fix the actual *performance* problems seen with codebrowse, it just
> might make hangs less likely. A more general approach to loggerhead's
> scalability would result in a fix for this and also for any performance
> issues that loggerhead sees in the Launchpad environment. A quick search
> for "python paste scale" in Google turns up
> http://pypi.python.org/pypi/Spawning/ which (after sufficient vetting)
> might be a reasonable solution.

Another team at Canonical tried spawning and had to give up and go back
to paste.  So let's learn from their misfortune :)

> Then once we have a better single-server
> solution, making it scale out to multiple servers (by having a central
> store for the revision graph cache and making sure that loggerhead plays
> well under load-balancing) would be the next step.

As Rob pointed out in the bug report, if we can have the load balancer
always direct requests for the same branch to the same loggerhead
backend, we don't need to worry too much about the central store part.

Speaking more generally, the problem is the revision cache -- can we
make it go away, or at least handle it better?  I always forget why we
actually need it, so let's try to recap:

 1. Going from revid -> revno.  Loggerhead does this a lot.
 2. Going from revno -> revid.  Probably done ~once per page.
 3. In History.get_revids_from().  This gets into behaviour territory.
Basically it "mainline-izes" a bunch of revisions.  It can probably
touch quite a lot of the graph.
 4. get_merge_point_list().  I can't remember what this does :(
 5. get_short_revision_history_by_fileid().  Just uses it to get the set
of all revids in the branch.

Y'see, one of the problems with a central graph store is that graphs are
big, and any central store implies IPC which implies serialization, and
serializing and deserializing something as big as Launchpad's revision
graph cache is annoyingly slow.  So one idea would be to have this
central store not serve up entire graphs, but instead be able to answer
the questions above.  There would be many problems with this approach of
course -- for example you probably don't want to make a cross procedure
call for every revid -> revno translation loggerhead does, and gathering
all the revids you'd want to translate before you start rendering would
be painful.

On the more serious end, it might be worth pushing the generation of the
cache into the store though and then it can compute stores in
subprocesses or whatever to maximize CPU utilization, and to maintain
performance of the loggerhead process(es).

Another, probably more tractable problem would be to be able to
incrementally generate revision caches in the common case of revisions
merely being added to the branch.  If the graph store stored the graphs
as more than just a lump, you can probably reuse parts of the graph for
mainline when building the graph for a derived branch.  I think John
Arbash Meinel might have some code for this...

In the mean time, if someone can tease out what:

    self.simplify_merge_point_list(self.get_merge_point_list(revid))

actually does, I'm all ears.

> 	Perhaps the best thing would be to come up with a "quick patch" to save
> the LOSAs from having to constantly restart codebrowse, and then once we
> have that situation at least mitigated, we could go on to work on the
> actual underlying scalability issue.

I'm not sure what the "quick patch" would be -- the mutex around
revision graph cache building?

Cheers,
mwh



Follow ups

References