[Date Prev][Date Next][Thread Prev][Thread Next][Date Index ][Thread Index ]

RE: LP bug list sorting issue

To: "'Chris Cheney'" <ccheney@xxxxxxxxxx>, "'Ubuntu distro team'" <distro-team@xxxxxxxxxxxxxxxxxxx>, <launchpad-users@xxxxxxxxxxxxxxxxxxx>
Subject: RE: LP bug list sorting issue
From: "Philippe Verdy" <verdy_p@xxxxxxxxxx>
Date: Mon, 5 Nov 2007 19:03:02 +0100
In-reply-to: <47235FB2.8090103@xxxxxxxxxx>
List-archive: <https://lists.ubuntu.com/archives/launchpad-users>
List-help: <mailto:launchpad-users-request@lists.canonical.com?subject=help>
List-id: Discussion for Launchpad users <launchpad-users.lists.canonical.com>
List-post: <mailto:launchpad-users@lists.canonical.com>
List-subscribe: <https://lists.ubuntu.com/mailman/listinfo/launchpad-users>, <mailto:launchpad-users-request@lists.canonical.com?subject=subscribe>
List-unsubscribe: <https://lists.ubuntu.com/mailman/listinfo/launchpad-users>, <mailto:launchpad-users-request@lists.canonical.com?subject=unsubscribe>
Organization: Ordinateur Personnel
References: <47235FB2.8090103@xxxxxxxxxx>
Reply-to: verdy_p@xxxxxxxxxx
Thread-index: Acge+ghNmlkz1jEiRZuQep1P2ZShcgA1Aq1A

De la part de Chris Cheney wrote:
> Envoyé : samedi 27 octobre 2007 17:57
> À : Ubuntu distro team; launchpad-users@xxxxxxxxxxxxxxxxxxx
> Objet : LP bug list sorting issue
> 
> I noticed recently that when sorted by oldest first on bugs on Launchpad
> it does more and causes the list of bugs to differ from a default
> unsorted list. I don't know if this should be considered a bug or not so
> I am posting here to see what others think.
> 
> Selecting 'oldest first' changes the url to include:
> 
> field.searchtext=&
> orderby=datecreated&
> search=Search&
> field.status%3Alist=NEW&
> field.status%3Alist=INCOMPLETE_WITH_RESPONSE&
> field.status%3Alist=CONFIRMED&
> field.status%3Alist=TRIAGED&
> field.status%3Alist=INPROGRESS&
> field.status%3Alist=FIXCOMMITTED&
> field.assignee=&
> field.bug_reporter=&
> field.omit_dupes=on&
> field.has_patch=&
> field.has_no_package=
> 
> It would seem to be logical to only add:
> 
> orderby=datecreated
> 
> Which does show the same number of bugs.
> 
> For OpenOffice as an example when I ran this test it showed 603 bugs
> before sorting and 574 after, but 603 when only adding
> orderby=datecreated to the url.

The extra empty parameters should have no impact on the search, they should
be silently discarded from the filters used in the SQL query by the server.

On the opposite, this bug shows that the "field.status" field may have other
values than those listed, and this is a problem whose origin can only be a
defect in the data consistency of the database, where unexpected
"field.status" values are also present (for example "status='new'" instead
of "status='NEW'", if incorrect capitalization is present, or "status IS
NULL" if a value is missing for the field in the database).

If the data-model requires that only those listed values:
> field.status%3Alist=NEW&
> field.status%3Alist=INCOMPLETE_WITH_RESPONSE&
> field.status%3Alist=CONFIRMED&
> field.status%3Alist=TRIAGED&
> field.status%3Alist=INPROGRESS&
> field.status%3Alist=FIXCOMMITTED&
should be present, then a database check should reveal those entries and
these entries should be fixed in the database.

Now, the server should check the list of values given in the HTTP query
before creating the SQL query:
* if all the expected values are present in the list, then drop them all
from the filter, to speed up the SQL query (less charge for the SQL server).
* if the number of unlisted values is lower than the number of listed values
in the HTTP query, then create a SQL query by using a series of
"status!='value'" connected by the SQL "AND" operator: the AND restriction
in a SQL query is MUCH faster than the OR (that generates a union of several
queries and requires creating a costly temporary hashmap in the server)
* use the "status='value'" form in SQL, ***only if*** the HTTP query lists a
single value. Otherwise, use "status!='value'" connected with the SQL AND
operator in the SQL query, using the complemented set (computed from the set
of possible values): if the complemented set is empty, you can drop the
filter.

For example, with the HTTP query above, there are more than one value listed
for "field.status", so the server will complement the listed values: it
should first create a set containing all the possible expected values, and
removes from this set each of the non-empty values listed in the HTTP query.
During this step, the web server script will increment a counter for the
number of values subtracted from the set.

At end of this step, it remains in the set of allowed values only those
values that are NOT wanted by the user:

* if no value has been removed from the set, i.e. the counter=0 (all values
are present in the remaining set), then return nothing to the user (none of
the listed non-empty values in the HTTP query exist in the database). Don't
need to perform a SQL query (saves resources on the SQL server for random
queries).

* if only one value has been subtracted, i.e. the counter=1, then use that
value to create a SQL query using the simple SQL restriction
"status='value'. This is efficient as it can use an index for that value to
perform indexed lookup in a very large table of bugs.

* if all values have been subtracted, i.e. the counter is maximum or the
remaining set is empty, then do not add any restriction in the SQL query;
This is efficient because the SQL query will not test this field.

* otherwise, ***DO NOT*** generate a SQL query using "OR"!!! This is costly
on the SQL server that will need to perform several queries and will need to
generate a temporary hashmap for computing the union set of results.
Instead, use each of the value still present in the remaining set to create
a filter using multiple "status!='value' connected by AND (the AND connector
does not require the creation of a temporary hashmap, it is matched by
performing a SINGLE indexed scan for the rows found by other "=" filters,
and checking if the field value is different from the value indicated in the
filter; no UNION set is needed.

In fact, this later strategy could be enhanced, if the webserver knows some
statistics about the number of rows matching each possible value. This can
be efficiently realized by indexing the table with a non-unique index on the
field, so that the server can perform very fast queries like:

* "SELECT DISTINCT status FROM bugtable"
  (to get the complete list of possible status values).
   The server can cache this list and refresh it from time to time.
* for each returned value from this list:
  "SELECT COUNT(*) FROM bugtable WHERE status='value'"
  The webserver can maintain this in a statistics cache, and refresh it
  from time to time.
* The queries above to collect these statistics can be combined in a
  single query:
  "SELECT status, COUNT(*) FROM bugtable GROUP BY status"

The statistics collected can be used to see it if is more efficient to
perform queries filtered by multiple "OR bugtable.status='value'" (using the
values from the ***complement*** of the remainining set computed above) : it
will be efficient only if the SUM of the statistics for each candidate value
is small (below 200?), and the number of generated OR clauses is below 4. If
this sum is above this threshold or if there are more than 4 allowed values
in the query, generate a filter using multiple "AND
bugtable.status!='value'" (using the values from the remaining set computed
above).

Note that some SQL servers (Oracle, Sybase, MS SQL, Informix, most of them
commercial...) can automatically transform queries this way, using their own
statistics collectors based on existing field indexes (no statistic cache is
needed if the field index is created with the UNIQUE constraint, because the
count of rows with each value is implicitly 1).

However, simple SQL servers like MySQL do not maintain statistics on
indexes, so it's up to the application (like the Launchpad server) to
collect and cache these statistics and generate the appropriate queries...

Collecting and maintaining the statistics is not needed if they are enforced
by the data-model. For example the first statistics query listed above:
  "SELECT DISTINCT status FROM bugtable"
is not needed if the data-model is correctly enforced by the application, so
that no row will be inserted or modified in the bugtable with an unexpected
value. If the "bugtable" however is indexed on status, the application
should use it, and should be able to collect these statistics dynamically.

References:
- LP bug list sorting issue
  - From: Chris Cheney

Prev by Date: RE: Additional Font Support necessary for Indian language "Malayalam"
Next by Date: fresh language-packs for gutsy
Previous by thread: LP bug list sorting issue
Next by thread: Re: LP bug list sorting issue
Index(es):
- Date
- Thread

This is the launchpad-users mailing list archive — see also the general help for Launchpad.net mailing lists.

(Formatted by MHonArc.)