launchpad-dev team mailing list archive

Thread
Date
Re: bug notifications database utilization

To: Ian Booth <ian.booth@xxxxxxxxxxxxx>, Danilo Šegan <danilo@xxxxxxxxxxxxx>
From: Gary Poster <gary.poster@xxxxxxxxxxxxx>
Date: Fri, 25 Mar 2011 22:04:56 -0400
Cc: launchpad-dev@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4D8C185E.5060403@canonical.com>
On Mar 25, 2011, at 12:21 AM, Ian Booth wrote:

> Hi folks
> 
> An update on the status of this. The system appears ok *for now*.

Thank you, Ian.  Also thanks to Danilo for his help today in looking at the symptoms and talking over possible solutions.

I'm regarding this as critical and no longer an incident, and I think that's where you left this.  Practically, that means I think I work on it until it is fixed, but we don't have to work on it over the weekend.

> A cowboy was applied to prod (the optimisation gary refers to below) but
> in reality it didn't make much difference to the sql being run. By far
> the latest number of sql statements were single selects to get
> subscribers' email details and subscription details. There were 100's of
> them. These should each be consolidated into a bulk select. Or something
> else done to address the issue. See bug
> https://bugs.launchpad.net/launchpad/+bug/742230

I analyzed the new bug and the logs more.  I used a trick to add some timestamps to our SQL logging: http://pastebin.ubuntu.com/585447/ .  It's a bit fussy, but gave a lot more visibility than nothing at all.  Running this on qastaging led to a red herring: sendmail was consuming 2/3 of the runtime on qastaging.  That's doubly odd because we don't even send anything out from qastaging to my knowledge.  However, returning to the production logs, this symptom does not seem to be the case: I don't believe sendmail is the culprit for the particular problem we are experiencing now, because correlating the logs with the code (and the timestamped version we saw on qastaging), we were most often sending emails quite fast--86 a second in one case Danilo looked at.

I got one run from the LOSAs of that SQL logging trick on production, but the bug notification queue was empty; the LOSAs did not get around to fulfilling my request for a second run.  Limited LOSA availability has made looking at this problem even harder than it might be otherwise.

I did some more work on optimizing the query problems we saw.  This branch reduces the pertinent queries in my analysis locally, and should make a particular difference for reducing queries for notifications based on team subscriptions.  That is, on a small scale of notifications, this code is 1/3 to 1/2 better; but I believe the characteristics of these changes will make a much bigger difference in production, hopefully in an order of magnitude.

lp:~gary/launchpad/bug741684

I'll try to get that cowboyed Monday to see if that addresses the query situation better.

We also should run representative unique SQL generated by this code with EXPLAIN ANALYZE, ideally in the production database, since qastaging and staging do not generate notifications in the quantity that production does.  Since I did not get a response for the next step of my previously mentioned request to the LOSAs, I never got to making this request of them.

> 
> The latest dbr report shows the bugnotification user using approx 10%
> cpu over the past few hours. Bug email is going out but the cpu is still
> far too high. Why is the system running "ok" now? I'm not sure and
> that's a worry.

As I saw some other people hypothesize, I'm guessing we're running just barely OK most of the time right now, but we can be flipped over into cascading errors whenever we tip over into extra capacity.

> 
> Another issue I noticed from the logs is as follows. The cronjob runs
> every 5 minutes. The logs over the past few days show that there are
> times when notifications are still being sent when the next script
> invocation occurs eg
> 
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> 2011-03-23 15:15:08 INFO    Creating lockfile:
> /var/lock/launchpad-send-bug-notifications.lock
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> ...
> 2011-03-23 15:15:16 INFO    Notifying xxx about bug 733732.
> 
> Often also the notification lines in the log are several seconds or more
> apart, indicating the call to sendmail() blocks for a time. So I have 2
> questions:
> 
> 1. How is the new script invocation happening if the old one appears to
> still be running? My theory is that the new script starts and blocks
> until the old one finishes. And if the next one is slow too, then it all
> compounds....

That doesn't quite jibe with what I think we see here, but I could be wrong.  The core issue does appear to be that it seems to be possible for a script to run simultaneously, though we haven't caught that smoking gun yet.

I don't intend to address this at the moment, though I'll be interested in tackling it when we go to bug rotation if someone else has not already looked at it.

> 
> 2. Why do some calls to sendmail() take so long to complete. And given
> they do, what can be done about it.

This is a good question, but the production logs appear to show that the core problem at the moment is in LP code.

> 
> So that's all for now. There's no definitive fix that's been applied,
> but the logs have given perhaps a little more insight into where to
> start looking next.

Thanks again, Ian and Danilo.

Gary



> 
> 
> On 24/03/11 23:05, Gary Poster wrote:
>> Thanks for the heads up, Stuart.
>> 
>> Stuart clarified on IRC that this is currently a scalability problem rather than a performance problem, though that could change.
>> 
>> This started happening after the most recent DB deploy.  The most likely cause is my own work to address bug 164196, "Quickly-undone actions shouldn't send mail notifications" (code: http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/revision/12533#lib/lp/bugs/scripts/bugnotification.py).
>> 
>> There is an obvious optimization to try (get the activity record along with the notification that points to it). I expect that will reduce the database usage, but I would be very surprised if it would get us down to the previous number.  We need to look at more data than before in order to answer these questions.
>> 
>> I've created https://bugs.launchpad.net/launchpad/+bug/741684 for this optimization.
>> 
>> Gary
>> 
>> On Mar 24, 2011, at 4:25 AM, Stuart Bishop wrote:
>> 
>>> Hi.
>>> 
>>> The database utilization report has picked up that bug notifications
>>> is now chewing 17% of a master database CPU core. 2 months ago, it was
>>> using <1% of a master database CPU core.
>>> 
>>> This seems excessive, and could be a red flag to making things more
>>> complex with more advanced filters.
>>> 
>>> -- 
>>> Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
>>> http://www.stuartbishop.net/
>> 
>> 
>> _______________________________________________
>> Mailing list: https://launchpad.net/~launchpad-dev
>> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~launchpad-dev
>> More help   : https://help.launchpad.net/ListHelp
>> 
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp
Follow ups

Re: bug notifications database utilization
From: Jeroen Vermeulen, 2011-03-26
References

bug notifications database utilization
From: Stuart Bishop, 2011-03-24
Re: bug notifications database utilization
From: Gary Poster, 2011-03-24
Re: bug notifications database utilization
From: Ian Booth, 2011-03-25