launchpad-dev team mailing list archive

Thread
Date
Re: Is it ok to report slightly inflated bug counts in portlets; counting bugs; accuracy vs speed

To: Robert Collins <robert.collins@xxxxxxxxxxxxx>
From: Martin Pool <mbp@xxxxxxxxxxxxx>
Date: Wed, 13 Apr 2011 20:12:02 +1000
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <BANLkTimUgiyxUem1B4y15ZcGuwHfZwrepw@mail.gmail.com>
Sender: martinpool@xxxxxxxxx
As a user, I think being approximate is generally speaking fine, with
a couple of caveats.

There are some counts that people want to drive to 0, such as critical
bugs, new bugs, inprogress bugs, maybe high bugs, maybe bugs assigned
to me.  If the count for them doesn't go to 0, or conversely if it
goes to 0 when some items still exist, it will be
annoying/confusing/frustrating.  I've hit this before when these
things were memcached (iirc).  Some projects/teams may use tags for
similar important workflow things.  I wouldn't notice 500 vs 700 but I
would notice 2 vs 3 critical bugs.

Secondly if it does happen that people get an accurate count through
the api or through paging through all the results, it may be
disconcerting if it doesn't match; but perhaps that can be handled by
just indicating that the numbers are approximate, or again being
accurate at low numbers.

2c,
Martin




On 13 April 2011 20:00, Robert Collins <robert.collins@xxxxxxxxxxxxx> wrote:
> https://bugs.launchpad.net/launchpad/+bug/758587 describes a challenge
> we have; counting bugs is expensive.
>
> portlets like those on
> https://bugs.staging.launchpad.net/ubuntu/+bugs?search=Search&field.status=In+Progress
> and https://bugs.staging.launchpad.net/ubuntu show various aggregates
> of bugs. They've been inaccurate (counting multiple tasks towards the
> count) in the past; I recently made them more accurate as part of
> making them faster; wgrant has just this week optimised the tags
> portlet.
>
> We're now about as fast as we can get while we do table scans of this
> data in our current architecture (*). Scanning the 700K bug and 900K
> bugtask tables together accounts for ~4 seconds of DB time once you
> factor in privacy [though there is some promise of improving on that,
> it won't be radically better].
>
> A fact table (en.wikipedia.org/wiki/Fact_table) can answer a number of
> aggregrates much more quickly, but at a cost. The cost is twofold: we
> have to maintain the table, and its losing some precision, so
> somethings become more vague and best answered from the source data.
>
> I have put together a prototype at the SQL level; it performs
> adequately: we can satisfy queries for ubuntu scale tags (queries are
> at the end of this mail for those interested - or read the gory
> details in the bug):
> top 10 tags (logged in) completes in < 0.5 seconds and considers 33K rows
> open bugs (logged in) completes in < 0.3 seconds and considers 14K rows
> top 10 tags (anonymous) completes in < 0.2 seconds and considers 10K rows
> open bugs (with privacy) completes in < 0.2 seconds and considers 110 rows
>
> But what about the costs? And thats where we come to why I'm writing this mail.
>
> Maintaining the table is easy - we can start with a background job
> every few minutes, it takes 2 minutes to update globally on staging,
> I'm sure on prod it would be even snappier. We can also update it with
> triggers or from our appserver code.
>
> The *results* of queries on this table will be inaccurate on private
> bugs though, if a user has 2 or more paths to visibility.
> Consider bug X, which I am directly subscribed to, and which
> ~launchpad is also assigned to. The schema I've come up with for this
> table will have the bug X counted twice:
>  - once with me as a viewer
>  - once with ~launchpad as a viewer
>
> but because we've discarded the bug ids that contributed to those rows
> in the summary table, we cannot tell that we shouldn't add the counts
> together, so we see 1 too many bugs when querying the table.
>
> We have a choice:
>  - make the data utterly precise
>  - be a bit simpler and accept some inaccuracy
>
> I'd like to go with simpler an inaccurate for now because:
>  - we went for years with inaccurate counts unnoticed by users [ who
> can tell that 700 should be 580, really?]
>  - the error rate is pretty low - my tests so far suggest ~5% [of
> private bugs] - so the numbers won't be *far* off.
>  - being more precise will be more complex - we'll need to query
> privacy separately
>
> So, specifically to jml & Deryck, but also to the whole team and any
> users following this discussion: is it ok to be slightly inflated but
> fast?
>
> -Rob
>
> (*) A horizontally sharded system could in principle calculate from
> raw data concurrently on 10 or 20 nodes and get the same numbers in
> 1/10th or 1/20th the time. Such system usually recommend having
> summary indices though which make that unnecessary.
>
> sample queries:
> tags (with privacy)
>  explain analyze select tag, sum(count) from bugsummary left join
> teamparticipation on viewedby=teamparticipation.team where
> distribution=1 and (viewedby is NULL or teamparticipation.person=2)
> and sourcepackagename is NULL and tag is not null  group by tag
> order by sum(count) desc
> limit 10;
>
>         QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=68724.35..68724.38 rows=10 width=14) (actual
> time=430.363..430.367 rows=10 loops=1)
>   ->  Sort  (cost=68724.35..68728.58 rows=1692 width=14) (actual
> time=430.363..430.364 rows=10 loops=1)
>         Sort Key: (sum(bugsummary.count))
>         Sort Method:  top-N heapsort  Memory: 25kB
>         ->  HashAggregate  (cost=68666.64..68687.79 rows=1692
> width=14) (actual time=427.296..429.017 rows=4460 loops=1)
>               ->  Nested Loop Left Join  (cost=0.00..66672.08
> rows=398912 width=14) (actual time=106.304..414.371 rows=10341
> loops=1)
>                     Filter: ((bugsummary.viewedby IS NULL) OR
> (teamparticipation.person = 2))
>                     ->  Seq Scan on bugsummary  (cost=0.00..7440.35
> rows=55700 width=18) (actual time=106.230..135.308 rows=33483 loops=1)
>                           Filter: ((sourcepackagename IS NULL) AND
> (tag IS NOT NULL) AND (distribution = 1))
>                     ->  Index Scan using teamparticipation_team_key
> on teamparticipation  (cost=0.00..1.05 rows=1 width=8) (actual
> time=0.003..0.006 rows=5 loops=33483)
>                           Index Cond: (bugsummary.viewedby =
> teamparticipation.team)
>  Total runtime: 430.495 ms
> (12 rows)
>
> tags (anonymous user)
> explain analyze select tag, sum(count) from bugsummary where
> distribution=1 and (viewedby is NULL) and sourcepackagename is NULL
> and tag is not null  group by tag
> order by sum(count) desc
> limit 10;
>                                                              QUERY
> PLAN
> --------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=7626.05..7626.07 rows=10 width=14) (actual
> time=191.697..191.704 rows=10 loops=1)
>   ->  Sort  (cost=7626.05..7628.39 rows=935 width=14) (actual
> time=191.695..191.699 rows=10 loops=1)
>         Sort Key: (sum(count))
>         Sort Method:  top-N heapsort  Memory: 25kB
>         ->  HashAggregate  (cost=7594.16..7605.84 rows=935 width=14)
> (actual time=185.884..189.067 rows=4446 loops=1)
>               ->  Seq Scan on bugsummary  (cost=0.00..7440.35
> rows=30761 width=14) (actual time=142.195..168.853 rows=10006 loops=1)
>                     Filter: ((viewedby IS NULL) AND
> (sourcepackagename IS NULL) AND (tag IS NOT NULL) AND (distribution =
> 1))
>  Total runtime: 191.804 ms
> (8 rows)
>
>
> open bugs (with privacy)
>  explain analyze select sum(count), status from bugsummary
> left join teamparticipation on
> bugsummary.viewedby=teamparticipation.team where
> distribution=1 and (viewedby is NULL or teamparticipation.person=2)
> and tag is NULL and status in (10, 15, 20, 21, 22, 25)
> and sourcepackagename is NULL
> group by status
> order by status;
>
>     QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=50096.87..50096.88 rows=1 width=8) (actual
> time=210.216..210.217 rows=6 loops=1)
>   Sort Key: bugsummary.status
>   Sort Method:  quicksort  Memory: 25kB
>   ->  HashAggregate  (cost=50096.85..50096.86 rows=1 width=8) (actual
> time=210.195..210.198 rows=6 loops=1)
>         ->  Nested Loop Left Join  (cost=2778.16..49589.40
> rows=101490 width=8) (actual time=95.659..209.967 rows=133 loops=1)
>               Filter: ((bugsummary.viewedby IS NULL) OR
> (teamparticipation.person = 2))
>               ->  Bitmap Heap Scan on bugsummary
> (cost=2778.16..8831.38 rows=14171 width=12) (actual
> time=95.608..118.717 rows=6803 loops=1)
>                     Recheck Cond: (tag IS NULL)
>                     Filter: ((sourcepackagename IS NULL) AND
> (distribution = 1) AND (status = ANY
> ('{10,15,20,21,22,25}'::integer[])))
>                     ->  Bitmap Index Scan on bugsummary_tag
> (cost=0.00..2774.62 rows=162711 width=0) (actual time=35.177..35.177
> rows=162711 loops=1)
>                           Index Cond: (tag IS NULL)
>               ->  Index Scan using teamparticipation_team_key on
> teamparticipation  (cost=0.00..2.86 rows=1 width=8) (actual
> time=0.009..0.012 rows=3 loops=6803)
>                     Index Cond: (bugsummary.viewedby = teamparticipation.team)
>  Total runtime: 210.336 ms
> (14 rows)
>
> Time: 281.852 ms
>
> open bugs(anonymous user)
>  explain analyze select sum(count), status from bugsummary where
> distribution=1 and (viewedby is NULL )
> and tag is NULL and status in (10, 15, 20, 21, 22, 25)
> and sourcepackagename is NULL
> group by status
> order by status;
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=8868.95..8868.95 rows=1 width=8) (actual
> time=101.659..101.662 rows=6 loops=1)
>   Sort Key: status
>   Sort Method:  quicksort  Memory: 25kB
>   ->  HashAggregate  (cost=8868.92..8868.94 rows=1 width=8) (actual
> time=101.634..101.639 rows=6 loops=1)
>         ->  Bitmap Heap Scan on bugsummary  (cost=2776.57..8829.79
> rows=7826 width=8) (actual time=91.224..101.527 rows=110 loops=1)
>               Recheck Cond: (tag IS NULL)
>               Filter: ((viewedby IS NULL) AND (sourcepackagename IS
> NULL) AND (distribution = 1) AND (status = ANY
> ('{10,15,20,21,22,25}'::integer[])))
>               ->  Bitmap Index Scan on bugsummary_tag
> (cost=0.00..2774.62 rows=162711 width=0) (actual time=28.688..28.688
> rows=162711 loops=1)
>                     Index Cond: (tag IS NULL)
>  Total runtime: 101.770 ms
> (10 rows)
>
> Time: 103.844 ms
>
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp
>
>
Follow ups

Re: Is it ok to report slightly inflated bug counts in portlets; counting bugs; accuracy vs speed
From: Robert Collins, 2011-04-13
References

Is it ok to report slightly inflated bug counts in portlets; counting bugs; accuracy vs speed
From: Robert Collins, 2011-04-13