launchpad-dev team mailing list archive

Thread
Date

Fwd: Results of Loggerhead Bug Investigation

To: Launchpad Development List <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: "Francis J. Lacoste" <francis.lacoste@xxxxxxxxxxxxx>
Date: Thu, 19 Nov 2009 09:54:45 -0500
Organization: Canonical Ltd.
User-agent: KMail/1.12.3 (Linux/2.6.31-14-generic; KDE/4.3.3; x86_64; ; )

----------  Forwarded Message  ----------

Subject: Results of Loggerhead Bug Investigation
Date: November 19, 2009
From: "Max Kanat-Alexander" <mkanat@xxxxxxxxxxxxxxxxxxxx>
To: "Francis J. Lacoste" <francis.lacoste@xxxxxxxxxxxxx>

	Hey Francis.

	So, I investigated the memory leak and the codebrowse hanging problem.

	The memory leak is just some part of the code leaking a tiny amount of
memory when a specific type of page is requested (I'm not sure which
page yet). The tiny leak grows over days until the process is very
large.  I can reproduce the leak locally. The rest of the work involved
in this would be tracking down where the leak occurs and patching it--I
suspect this will not be a major architectural change, just a fix to
loggerhead or perhaps Paste. However, I think the task of initial
analysis is complete.

	The more significant issue is the hangs. The hang is, in a sense, two
separate issues:

	1) When a user loads multiple revisions of a very large branch
(launchpad itself, bzr itself, or mysql) that doesn't have a revision
graph yet, building the revision graph takes an enormous amount of time
and causes the rest of loggerhead to slow to a crawl, thus causing it to
appear hung for three to five minutes.

	2) Loggerhead (or perhaps just a single loggerhead instance) doesn't
scale very well across many branches with many users, partially because
of how the revision graph is currently built and partially (I suspect)
because any given Python process is going to be limited by the Global
Interpreter Lock on how many concurrent requests it can honestly handle.

	So the question for this issue is--what level would you like me to
address it on? If you'd like me to simply work on the revision graph
issue, I could do that within the current architecture of loggerhead and
devise a fix. Probably the simplest would be to just place a mutex
around building a revision graph for any one branch. However, that may
not fix the actual *performance* problems seen with codebrowse, it just
might make hangs less likely. A more general approach to loggerhead's
scalability would result in a fix for this and also for any performance
issues that loggerhead sees in the Launchpad environment. A quick search
for "python paste scale" in Google turns up
http://pypi.python.org/pypi/Spawning/ which (after sufficient vetting)
might be a reasonable solution. Then once we have a better single-server
solution, making it scale out to multiple servers (by having a central
store for the revision graph cache and making sure that loggerhead plays
well under load-balancing) would be the next step.

	Perhaps the best thing would be to come up with a "quick patch" to save
the LOSAs from having to constantly restart codebrowse, and then once we
have that situation at least mitigated, we could go on to work on the
actual underlying scalability issue.

	Does that sound good to you?

	-Max
-- 
Max Kanat-Alexander
Chief Engineer
http://www.everythingsolved.com/
Everything Solved: Complete Computer Management

-------------------------------------------------------
-- 
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx

Attachment: signature.asc
Description: This is a digitally signed message part.

Follow ups

Re: Fwd: Results of Loggerhead Bug Investigation
From: Jeroen Vermeulen, 2009-11-20
Re: Fwd: Results of Loggerhead Bug Investigation
From: Michael Hudson, 2009-11-20
Re: Fwd: Results of Loggerhead Bug Investigation
From: Stuart Bishop, 2009-11-19