← Back to team overview

launchpad-dev team mailing list archive

Re: Is it ok to report slightly inflated bug counts in portlets; counting bugs; accuracy vs speed

 

On Wed, Apr 13, 2011 at 11:15 PM, John Arbash Meinel
<john@xxxxxxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>> I have put together a prototype at the SQL level; it performs
>> adequately: we can satisfy queries for ubuntu scale tags (queries are
>> at the end of this mail for those interested - or read the gory
>> details in the bug):
>> top 10 tags (logged in) completes in < 0.5 seconds and considers 33K rows
>> open bugs (logged in) completes in < 0.3 seconds and considers 14K rows
>> top 10 tags (anonymous) completes in < 0.2 seconds and considers 10K rows
>> open bugs (with privacy) completes in < 0.2 seconds and considers 110 rows
>
> I'm a little curious about 'blowout'. Because you have one row per
> tag*status*milestone. 'status' is fortunately bounded, but milestone can
> grow considerably. (In case of bzr, we currently have 150 milestones on
> LP, though only a few open ones.)
>
> 'tag' also has the tendency to accrue many items. I think you mentioned
> that Launchpad has 151 *official* tags, and I don't know how many more
> unofficial tags you have.
>
> So for a bzr/launchpad target, you could have 150*151*10 = 226,500 rows.
> I'm not 100% sure how many actual bugs we have, but it is on the order
> of <5,000 bugs (including fixed, open, and invalid bugs).
> ...
>
> I'm sure you could optimize and not include the full cross product. Any
> rows that sum to 0 would not be included, etc. If I read your comments
> correctly, the full expansion on staging was only 2.6M rows? (and 1.3M
> were for ubuntu itself?)
>
> I do realize your queries are nicely efficient at getting a subset. But
> turning a 900k BugTask table + 700k Bug table into a 2.6M row table
> doesn't seem to be scaling in the right direction.

Its not that bad :) - and the selectivity is great.

 select id from product where name='bzr';
  id
------
 1186
(1 row)

Time: 257.583 ms
lpmain_staging=> select count(*) from bugsummary where product=1186;
 count
-------
  1170
(1 row)

select count(*) from bugtask where product=1186;
 count
-------
  5833


Time: 13.447 ms
lpmain_staging=> select count(*) from bugsummary;
 count
--------
 371308

So its 1/5th the size for bzr, and 1/3 the size for everything - but
its one table not three, so its actually closer to 1/10th for bzr and
1/6th for everything.

> I know that bzr has driven New and Critical to 0 quite often. If privacy
> is the only inaccurate measure, you could always split the numbering up.
> So you would see New 10 (~2 private). You could even include the 2 in
> the count, and then put ~2 private in a mouse over tooltip.
>
> Certainly I found it annoying when closing the last New bug and I
> returned to the page to have it say "1 New" and clicking on the New link
> gave me a search result with 0 entries in it. (I don't know that I
> bothered to file a bug, maybe.)

That was memcache caching, which if we maintain this page in realtime
won't be needed.

> Below a certain threshold, I imagine the actual queries for counts are
> actually pretty fast. I don't know the queries, but if you have only 10
> New bugs, it shouldn't be very slow to compute. For counts above some
> threshold (where the DB wants to do a table scan), you could use
> aggregate numbers and just truncate them. So instead of seeing 81,962
> New, you would see 82k New. (That makes it hard to truncate at, eg 100
> bugs, though.)

We have some code to estimate counts; the problem is tying it all in
:) - in some ways this aggregate is better than estimation because
unlike estimation it won't be far out ever (estimation can sometimes
be -very- wrong).

> Anyway, my big concern is that it turns into a BranchRevision table.
> Which AFAIK also has good selectivity but its sheer size makes it very
> cumbersome to deal with. Also, as far as accuracy, if you only update it

We have queries on it that bring back 200000 rows to do set
differences; thats 10 times the data for the fact table - and its
being joined on awkwardly; making those do a graphtraversal in the
query (SELECT WITH RECURSIVE) should get it processing a few thousand
rows most of the time - which will be a lot faster.

> in Batch mode, then it is going to have the same accuracy problem that
> memcache did. And you'll close the last bug without the satisfaction of
> seeing "0 New". (There is a real feedback loop of 'I accomplished
> something' that I think you want to preserve.)

Sure, but we have that problem now with memcache. What we have now is:
 - stale figures (cached at all)
 - cached wrongly (you can see the figures Martin did a few minutes ago)

I think we can easily move to
 - cached (for a few minutes) and inflated-on-private-bugs only

with a second pass to get to
 - live but inflated-on-private-bugs only

And thus the question - should we? :)

-Rob



Follow ups

References