← Back to team overview

launchpad-dev team mailing list archive

XMLRPC server reconfiguration : zope assertion errors on bzr push

 

We had a period - about 36 hours long - where the xmlrpc internal
server was simply saturated and overloaded. During this period folk
would experience a backtrace as described in
https://bugs.launchpad.net/launchpad/+bug/674416.

The master bug for this is
https://bugs.launchpad.net/launchpad-code/+bug/674305.

I'm writing to let everyone know the current status in case either:
 - the issue is not fixed
 - the fix triggers a different failure

Francis and I had discussed this earlier and thought that disabling
the importds (which are still disabled AFAIK) would mitigate the
problem, but pretty much immediately after he signed off we got more
reports.

This was a pretty high visibility problem - while occuring it would
happen reliably push after push; its scary and unlike our timeouts
unclear about its implications.

So, I escalated it within ISO - got Charlie and James at dinner, and
James popped into IRC when they had finished.

 What we did was to execute RT 41465 - the highest priority RT for LP
at the time, because that was filed with the explicit goal of sharing
the load from our XMLRPC services over more workers.

We considered other options:
 - just leave it (not good, it was widespread, pervasive and very
worrying for users (is there data safe? Yes - but how do they know)
 - put a loop in in the codehosting server - high risk, code change
with unknown knock on effects
 - disable more services (couldn't: mailing lists, codehosting, apache
rewrites for codebrowse were all that remained driving traffic to the
service).

Having done that consideration, James Troup spent a number of hours
this evening reconfiguring the xmlrpc internal server : its now served
from the main lpnet cluster (basically the same configuration that
e.g. qastaging and staging use) - which gives these APIs 10 times the
resources - 52 worker threads rather than 4 (though shared amongst
many more users too).

To revert this (in the event of a pear-shaping occurence):
 - DNS needs to be changed back
 - a /etc/hosts on the codehosting machines needs to be changed back similarly.

If we're right about the driving issue in this, we should see many
less timeouts for internal XMLRPC operations from tomorrow. As I write
this, https://lpstats.canonical.com/graphs/CodehostingPerformance/20101107/20101114/
shows a period of high-latency ssh connection spikes (which are driven
by xml responsiveness - its how we get the ssh keys) may be over.
https://lpstats.canonical.com/graphs/CodehostingPerformance/20101113/20101114/
gives a closer view - you can clearly see that single requests were
spiking up to > 10 seconds.

Similarly the current days oops are at:
Time Out Counts by Page ID

Hard	Soft	Page ID
384	578	CodehostingApplication:CodehostingAPI
94	832	CodeImportSchedulerApplication:CodeImportSchedulerAPI
89	14	Person:+commentedbugs
14	39	BugTask:+index
10	16	Archive:EntryResource:getBuildSummariesForSourceIds
4	0	https://api.edge.launchpad.net
4	0	ProjectGroup:+milestones
2	37	Distribution:+bugtarget-portlet-bugfilters-stats
2	7	DistroSeriesLanguage:+index
2	4	Distribution:+archivemirrors
(from https://devpad.canonical.com/~lpqateam/lpnet-oops.html#time-outs)

Again, if it was over-saturation causing the issue, we should expect
the first two rows to stay constant (or nearly so) over the remainder
of the day. Its been stable for a good 30+ minutes now - in fact
+commented bugs just passed the codeimportschedulerAPI, with the xml
resources staying constant. This is a good sign.

We have some followup work we should do:
 - reprovision the old xmlrpc server as a regular appserver
 - do the single-threaded appserver experiment
 - delete the now unused OOPS prefix for the gone appserver instance &
delete the production config for it.

-Rob



Follow ups