← Back to team overview

launchpad-dev team mailing list archive

Re: reminder: test changed queries on qastaging *especially* for large tables *and* positively id as existing bugs any timeouts

 

Hi,

I'm the one responsible for this blunder and apologize for it. (And
thanks to William for fixing this while I was sleeping).

1) When you get an OOPS on staging do a thorough analysis. That means
looking at _all_ the OOPS you get, and ensuring that the problem is a
known problem, and that nothing weird related to your changes show up.
In my case, I only look at one of the last OOPS I got which showed no
problem apart from known recalculateBugHeat issue:
OOPS-1998QASTAGING104) But that was the OOPS related to my 3rd attempt.
The first one, OOPS-1998QASTAGING102 (which I didn't investigate) showed
the problem with a cold cache. The new query took 9s in there. (But was
very fast <63ms on the second and third attempts).

2) When "tuning" queries, please leave in comments in the code! There
was not comment here and thought naively that I should get rid of the
extra query to get the archive ids and use a join instead. Bad bad idea
it seemed. A comment explaining this non-intuitive query would have
saved me re-learning that already learned lesson :-)

On 11-06-21 10:22 PM, Robert Collins wrote:
> We are currently dealing with bug 800485 where validation of
> sourcepackagenames has gone from 80ms to 1800ms(hot) or minutes
> (cold).
> 
> This was caused when a patch changed a non-storm query to a storm
> query *and* added a single join table in (rather than the substituted
> archive ids).
> 
> Most of our queries are now tuned; postgresql consistently chooses bad
> plans on the 'obvious' way to write things for many of our very large,
> or very skewed data sets.
> 
> As a result, whenever you change a query on a big table - where big
> means > 20K rows - its important to try and exercise it on qastaging.
> 
> If the thing you are testing times out, its *vital* that the timeout
> be positively identified as a pre-existing condition before assuming
> qastaging is slow[1].
> 
> In this particular case, the patch was qa'd, but an existing timeout
> bug was assumed to be the cause of qa timeouts: we should have grabbed
> the oops and positively id'd the timeout as the existing bug - that
> would have told us about the regression and let us avoid the crisis.
> 
> 1) how slow is qastaging? Its not, not really. It has enough memory on
> the DB server to page into hot cache the working set for any one page
> in the system: you may need to try a lot of times to seed the cache,
> but *everything* *can* work on qastaging.
> 
> 
> -Rob
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp


-- 
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx

Attachment: signature.asc
Description: OpenPGP digital signature


Follow ups

References