← Back to team overview

launchpad-dev team mailing list archive

status of search performance - why do we use | not & ?

 

I've been looking into our search performance, particularly on our
duplicate finding page.

The basic problem appears to be selectivity (now that the
table-scanning front-end code is eliminated).

We're doing an | operation to combine search terms.

That means, than when a user searches for 'installing eclipse I get an
unmet dependencies error', we translate it into:
SELECT some-stuff FROM Bug, BugTask WHERE Bug.id = BugTask.bug AND
BugTask.distribution = 1 AND Bug.fti @@
ftq('depend|eclips|error|get|instal|unmet') AND (Bug.private = FALSE
OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription,
TeamParticipation WHERE TeamParticipation.person = 2 AND
BugSubscription.person = TeamParticipation.team AND
BugSubscription.bug = Bug.id)) AND (1=1) ORDER BY -rank(bug.fti,
ftq('depend|eclips|error|get|instal|unmet')), bugtask.id LIMIT 40
OFFSET 0;

This evaluates 200000 rows to sort and then discards them:
 count
--------
 216995
(1 row)
Time: 4897.453 ms

This 200000 result set is arguably fallout from removing the
table-wide scanning front end code, but it is still less work overall:
and its exposed the deeper problem of using |.

Consider this instead:
SELECT count(*) FROM Bug, BugTask ,
ftq('(depend&eclips&error&get&instal&unmet)|(error&get&instal&unmet)|(depend&eclips&get&instal&unmet)|(depend&eclips&error&instal&unmet)|(depend&eclips&error&get&unmet)|(depend&eclips&error&get&instal)')
as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND
Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT
BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE
TeamParticipation.person = 2 AND BugSubscription.person =
TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1);
 count
-------
   260
(1 row)
Time: 607.873 ms

This hand crafted query looks for 'any four of the terms'  When ranked
and limited its:
(40 rows)
Time: 564.161 ms

Which is tolerable.

Alternatively, we can just search for all the terms supplied:

SELECT count(*) FROM Bug, BugTask ,
ftq('(depend&eclips&error&get&instal&unmet)') as query WHERE Bug.id =
BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND
(Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM
BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2
AND BugSubscription.person = TeamParticipation.team AND
BugSubscription.bug = Bug.id)) AND (1=1);
 count
-------
     0
(1 row)
Time: 62.975 ms

This is perhaps going to show 'nothing visible' when really its a near miss.

So, I'm thinking that perhaps a better way - and this is a band aid,
until we fully review search - is to progressively return less and
less refined searches until we have our N(defaults to 10 )or so
similar bugs.

That is (psuedocode):
for bug in find(depend&eclips&error&get&instal&unmet): yield bug
for bug in find((error&get&instal&unmet)|(depend&eclips&get&instal&unmet)|(depend&eclips&error&instal&unmet)|(depend&eclips&error&get&unmet)|(depend&eclips&error&get&instal)):
yield bug
for bug in find ....

We should be able to represent this in sql with some effort - a first
approximate would be
SELECT 1, bugtask.id FROM Bug, BugTask ,
ftq('(depend&eclips&error&get&instal&unmet)') as query WHERE Bug.id =
BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND
(Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM
BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2
AND BugSubscription.person = TeamParticipation.team AND
BugSubscription.bug = Bug.id)) AND (1=1) union SELECT 2, bugtask.id
FROM Bug, BugTask ,
ftq('(depend&eclips&error&get&instal&unmet)|(error&get&instal&unmet)|(depend&eclips&get&instal&unmet)|(depend&eclips&error&instal&unmet)|(depend&eclips&error&get&unmet)|(depend&eclips&error&get&instal)')
as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND
Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT
BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE
TeamParticipation.person = 2 AND BugSubscription.person =
TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1)
order by 1 LIMIT 40 OFFSET 0;

Where each successive expansion gets a higher ordinal - some brief
experiments suggest this doesn't short circuit though - even though
its satisfied early on it keeps processing just incase one of the
other statements in the union has a different first column.

The theory being:
 - highly refined searches are very fast
 - we'll get reasonably relevant (most terms found - sadly not tf-idf,
but I have ideas for that) first
 - we can stop far short of provoking 14-30 second searches on prod

There are, fortunately, only a few users of nl_search to modify in this way.

Stuart, if you could comment on my reasoning here, and on whether
there is a good way to represent this directly in SQL (so we can pass
around expression objects rather than something more complex), that
would be wonderful.

Of course, the API wanting to count(*) will shoot us in the foot
still, but we can fix API search once the web UI is fixed.

-Rob



Follow ups