← Back to team overview

launchpad-dev team mailing list archive

Is it ok to report slightly inflated bug counts in portlets; counting bugs; accuracy vs speed

 

https://bugs.launchpad.net/launchpad/+bug/758587 describes a challenge
we have; counting bugs is expensive.

portlets like those on
https://bugs.staging.launchpad.net/ubuntu/+bugs?search=Search&field.status=In+Progress
and https://bugs.staging.launchpad.net/ubuntu show various aggregates
of bugs. They've been inaccurate (counting multiple tasks towards the
count) in the past; I recently made them more accurate as part of
making them faster; wgrant has just this week optimised the tags
portlet.

We're now about as fast as we can get while we do table scans of this
data in our current architecture (*). Scanning the 700K bug and 900K
bugtask tables together accounts for ~4 seconds of DB time once you
factor in privacy [though there is some promise of improving on that,
it won't be radically better].

A fact table (en.wikipedia.org/wiki/Fact_table) can answer a number of
aggregrates much more quickly, but at a cost. The cost is twofold: we
have to maintain the table, and its losing some precision, so
somethings become more vague and best answered from the source data.

I have put together a prototype at the SQL level; it performs
adequately: we can satisfy queries for ubuntu scale tags (queries are
at the end of this mail for those interested - or read the gory
details in the bug):
top 10 tags (logged in) completes in < 0.5 seconds and considers 33K rows
open bugs (logged in) completes in < 0.3 seconds and considers 14K rows
top 10 tags (anonymous) completes in < 0.2 seconds and considers 10K rows
open bugs (with privacy) completes in < 0.2 seconds and considers 110 rows

But what about the costs? And thats where we come to why I'm writing this mail.

Maintaining the table is easy - we can start with a background job
every few minutes, it takes 2 minutes to update globally on staging,
I'm sure on prod it would be even snappier. We can also update it with
triggers or from our appserver code.

The *results* of queries on this table will be inaccurate on private
bugs though, if a user has 2 or more paths to visibility.
Consider bug X, which I am directly subscribed to, and which
~launchpad is also assigned to. The schema I've come up with for this
table will have the bug X counted twice:
 - once with me as a viewer
 - once with ~launchpad as a viewer

but because we've discarded the bug ids that contributed to those rows
in the summary table, we cannot tell that we shouldn't add the counts
together, so we see 1 too many bugs when querying the table.

We have a choice:
 - make the data utterly precise
 - be a bit simpler and accept some inaccuracy

I'd like to go with simpler an inaccurate for now because:
 - we went for years with inaccurate counts unnoticed by users [ who
can tell that 700 should be 580, really?]
 - the error rate is pretty low - my tests so far suggest ~5% [of
private bugs] - so the numbers won't be *far* off.
 - being more precise will be more complex - we'll need to query
privacy separately

So, specifically to jml & Deryck, but also to the whole team and any
users following this discussion: is it ok to be slightly inflated but
fast?

-Rob

(*) A horizontally sharded system could in principle calculate from
raw data concurrently on 10 or 20 nodes and get the same numbers in
1/10th or 1/20th the time. Such system usually recommend having
summary indices though which make that unnecessary.

sample queries:
tags (with privacy)
 explain analyze select tag, sum(count) from bugsummary left join
teamparticipation on viewedby=teamparticipation.team where
distribution=1 and (viewedby is NULL or teamparticipation.person=2)
and sourcepackagename is NULL and tag is not null  group by tag
order by sum(count) desc
limit 10;

         QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=68724.35..68724.38 rows=10 width=14) (actual
time=430.363..430.367 rows=10 loops=1)
   ->  Sort  (cost=68724.35..68728.58 rows=1692 width=14) (actual
time=430.363..430.364 rows=10 loops=1)
         Sort Key: (sum(bugsummary.count))
         Sort Method:  top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=68666.64..68687.79 rows=1692
width=14) (actual time=427.296..429.017 rows=4460 loops=1)
               ->  Nested Loop Left Join  (cost=0.00..66672.08
rows=398912 width=14) (actual time=106.304..414.371 rows=10341
loops=1)
                     Filter: ((bugsummary.viewedby IS NULL) OR
(teamparticipation.person = 2))
                     ->  Seq Scan on bugsummary  (cost=0.00..7440.35
rows=55700 width=18) (actual time=106.230..135.308 rows=33483 loops=1)
                           Filter: ((sourcepackagename IS NULL) AND
(tag IS NOT NULL) AND (distribution = 1))
                     ->  Index Scan using teamparticipation_team_key
on teamparticipation  (cost=0.00..1.05 rows=1 width=8) (actual
time=0.003..0.006 rows=5 loops=33483)
                           Index Cond: (bugsummary.viewedby =
teamparticipation.team)
 Total runtime: 430.495 ms
(12 rows)

tags (anonymous user)
explain analyze select tag, sum(count) from bugsummary where
distribution=1 and (viewedby is NULL) and sourcepackagename is NULL
and tag is not null  group by tag
order by sum(count) desc
limit 10;
                                                              QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=7626.05..7626.07 rows=10 width=14) (actual
time=191.697..191.704 rows=10 loops=1)
   ->  Sort  (cost=7626.05..7628.39 rows=935 width=14) (actual
time=191.695..191.699 rows=10 loops=1)
         Sort Key: (sum(count))
         Sort Method:  top-N heapsort  Memory: 25kB
         ->  HashAggregate  (cost=7594.16..7605.84 rows=935 width=14)
(actual time=185.884..189.067 rows=4446 loops=1)
               ->  Seq Scan on bugsummary  (cost=0.00..7440.35
rows=30761 width=14) (actual time=142.195..168.853 rows=10006 loops=1)
                     Filter: ((viewedby IS NULL) AND
(sourcepackagename IS NULL) AND (tag IS NOT NULL) AND (distribution =
1))
 Total runtime: 191.804 ms
(8 rows)


open bugs (with privacy)
 explain analyze select sum(count), status from bugsummary
left join teamparticipation on
bugsummary.viewedby=teamparticipation.team where
distribution=1 and (viewedby is NULL or teamparticipation.person=2)
and tag is NULL and status in (10, 15, 20, 21, 22, 25)
and sourcepackagename is NULL
group by status
order by status;

     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=50096.87..50096.88 rows=1 width=8) (actual
time=210.216..210.217 rows=6 loops=1)
   Sort Key: bugsummary.status
   Sort Method:  quicksort  Memory: 25kB
   ->  HashAggregate  (cost=50096.85..50096.86 rows=1 width=8) (actual
time=210.195..210.198 rows=6 loops=1)
         ->  Nested Loop Left Join  (cost=2778.16..49589.40
rows=101490 width=8) (actual time=95.659..209.967 rows=133 loops=1)
               Filter: ((bugsummary.viewedby IS NULL) OR
(teamparticipation.person = 2))
               ->  Bitmap Heap Scan on bugsummary
(cost=2778.16..8831.38 rows=14171 width=12) (actual
time=95.608..118.717 rows=6803 loops=1)
                     Recheck Cond: (tag IS NULL)
                     Filter: ((sourcepackagename IS NULL) AND
(distribution = 1) AND (status = ANY
('{10,15,20,21,22,25}'::integer[])))
                     ->  Bitmap Index Scan on bugsummary_tag
(cost=0.00..2774.62 rows=162711 width=0) (actual time=35.177..35.177
rows=162711 loops=1)
                           Index Cond: (tag IS NULL)
               ->  Index Scan using teamparticipation_team_key on
teamparticipation  (cost=0.00..2.86 rows=1 width=8) (actual
time=0.009..0.012 rows=3 loops=6803)
                     Index Cond: (bugsummary.viewedby = teamparticipation.team)
 Total runtime: 210.336 ms
(14 rows)

Time: 281.852 ms

open bugs(anonymous user)
 explain analyze select sum(count), status from bugsummary where
distribution=1 and (viewedby is NULL )
and tag is NULL and status in (10, 15, 20, 21, 22, 25)
and sourcepackagename is NULL
group by status
order by status;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=8868.95..8868.95 rows=1 width=8) (actual
time=101.659..101.662 rows=6 loops=1)
   Sort Key: status
   Sort Method:  quicksort  Memory: 25kB
   ->  HashAggregate  (cost=8868.92..8868.94 rows=1 width=8) (actual
time=101.634..101.639 rows=6 loops=1)
         ->  Bitmap Heap Scan on bugsummary  (cost=2776.57..8829.79
rows=7826 width=8) (actual time=91.224..101.527 rows=110 loops=1)
               Recheck Cond: (tag IS NULL)
               Filter: ((viewedby IS NULL) AND (sourcepackagename IS
NULL) AND (distribution = 1) AND (status = ANY
('{10,15,20,21,22,25}'::integer[])))
               ->  Bitmap Index Scan on bugsummary_tag
(cost=0.00..2774.62 rows=162711 width=0) (actual time=28.688..28.688
rows=162711 loops=1)
                     Index Cond: (tag IS NULL)
 Total runtime: 101.770 ms
(10 rows)

Time: 103.844 ms



Follow ups