launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #06879
decoupling bugmessage and message in the schema
So, I've been analyzing a bunch of bug performance, and the
Bug->BugMessage->Message split is a significant burden.
We end up doing a 4 table *minimum* query to answer things like 'what
open bugs has user X commented on'.
This is a problem because:
- the openness (or not) is contained in bugtask
- the message commenting status is contained in message
- the bugmessage link is separate - but we need that.
Take pitti for example: 75000 messages
33K bugmessage links
Just enumerating those two fields takes 4.5 seconds:
select count(bugmessage.id) from message,bugmessage where owner=100
and bugmessage.message=message.id;
count
-------
33149
(1 row)
Time: 4531.640 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=256803.87..256803.88 rows=1 width=4) (actual
time=5472.919..5472.920 rows=1 loops=1)
-> Hash Join (cost=142973.14..256615.52 rows=75337 width=4)
(actual time=116.964..5465.813 rows=33149 loops=1)
Hash Cond: (bugmessage.message = message.id)
-> Seq Scan on bugmessage (cost=0.00..61445.48 rows=3741348
width=8) (actual time=0.010..1088.603 rows=3737977 loops=1)
-> Hash (cost=142031.96..142031.96 rows=75294 width=4)
(actual time=116.874..116.874 rows=75175 loops=1)
-> Bitmap Heap Scan on message
(cost=1205.51..142031.96 rows=75294 width=4) (actual
time=20.798..87.278 rows=75175 loops=1)
Recheck Cond: (owner = 100)
-> Bitmap Index Scan on message_owner_idx
(cost=0.00..1186.69 rows=75294 width=0) (actual time=14.215..14.215
rows=75175 loops=1)
Index Cond: (owner = 100)
Total runtime: 5474.103 ms
As you can see here:
Hash Cond: (bugmessage.message = message.id)
-> Seq Scan on bugmessage
(cost=0.00..70798.85 rows=3027910 width=8) (actual
time=0.021..1354.923 rows=3026226 loops=1)
Filter: (index > 0)
-> Hash (cost=142031.96..142031.96
rows=75294 width=4) (actual time=122.059..122.059 rows=75175 loops=1)
-> Bitmap Heap Scan on message
(cost=1205.51..142031.96 rows=75294 width=4) (actual
time=22.913..92.674 rows=75175 loops=1)
Recheck Cond: (owner = 100)
this is precisely the subplan that turns up in the page for 'bugs this
user has commented on' for high volume users.
Now, I may be wrong, but I think its worth duplicating message rather
than sharing them between bugs and questions and merge proposals: we
have /very/ few shared messages today (640 bug messages have shared
ids). This doesn't need to mean duplicate model code, but we may need
to do something a little fancy to reuse model code with different db
tables.
Having a single table with the abillty to answer all three constraints
(bug, message owner, message index) at once would let postgresql do a
bitmap join between the bugtask.bug and bugmessage.bug fields (given
an appropriate index on bugmessage - and then bring in the bug itself
for privacy and product visibility.
We can move incrementally to this - for instance, we can start with
denormalised trigger-maintained fields on bugmessage, and if we can
get great queries out of that, do the larger shuffle.
Are there any reasons we /shouldn't/ do this?
Relatedly, BuildFarmJob and its related tables suffers nearly
identical poor or pathological query behaviour, but one thing at a
time.
-Rob
Follow ups