launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #02592
Re: Sorting bugs/bugtasks by patch age
On Tue, Feb 9, 2010 at 12:07 AM, Jonathan Lange <jml@xxxxxxxxxxxxx> wrote:
On Fri, Jan 29, 2010 at 12:22 PM, Abel Deuring
<abel.deuring@xxxxxxxxxxxxx> wrote:
On 29.01.2010 10:54, Jonathan Lange wrote:
On Fri, Jan 29, 2010 at 9:53 AM, Jonathan Lange <jml@xxxxxxxxxxxxx> wrote:
On Thu, Jan 28, 2010 at 7:30 PM, Abel Deuring
<abel.deuring@xxxxxxxxxxxxx> wrote:
In other words, a deliberate denormalisation. What do you think, should
we start with a plain query without such a denormalsation, or do you
think that it is better to add the "cache column"
date_created_youngest_patch to the Bug table?
This is really Stuart's province. I don't have a strong opinion on this.
However, it seems to me that we're getting increasingly frequent
requests to denormalize data in order to make querying faster. Perhaps
we need a general answer, or some other technology.
Perhaps, but finding a general answer is not trivial, I think. "Think
harder to find a better query" is one answer, but won't give useful
results in each case; "We don't really need such a query" is another
(though Bryce had good reasons to ask for the sort order we're
discussing); "Let's accept denormalisation after thoughtful
consideration if we find no better way" seems most reasonable to me.
Regarding other technology, I'm quite curious about suggestions.
Mostly I mean a data warehouse.
In any case, this is Stuart's call. It's about the database and not
really about the nature of the product.
Bug searching and sorting has sucked for a long time. Our data model seems to represent things, but we cannot query it efficiently the ways we need to. I think this first became apparent when we added the 'private' flag to Bug several years ago.
I'm not happy about the cache columns we have added so far - its probably a mistake to mix the denormalized data with the normalized data. Its unclear and messy and can causes concurrency issues when cache data is being refreshed in bulk (not really an issue with bugs, but we see it with translations and code hosting).
I think the way forward is to define a bug search schema, which might just be a single super flat table (or perhaps two - a big one with public bug information and a small one with private ones), that supports efficient querying. The closest we have to this that I can think of is the Karma cache tables. I suspect this will also vastly simplify the hideous bug searching code as the queries that need to be generated will be much, much simpler.
The bug search schema can be updated in a number of ways - zope events, database triggers, cron jobs, messaging system. This is a secondary, separate problem.
Steps would be:
1) identify how we search, filter and sort our 500k+ bugs into reports useful to our end users.
2) define a suitable schema.
3) define database queries on the new schema to match our needs determined in step one. If the queries are not simple, go back to step 2.
4) benchmark on real data. If searches suck, go back to step 2.
5) Add code to keep the bug search schema in sync with reality with acceptable lag. Acceptable needs to be defined.
6) Land work
7+) Update bugs.launchpad.net pages to use the new search schema.
An alternative approach would be to not use the relational database for searching at all. Perhaps some other technology would be more helpful, such as a bugs search service in front of an in memory bugs database.
--
Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
http://www.stuartbishop.net/
Attachment:
signature.asc
Description: OpenPGP digital signature
References