← Back to team overview

launchpad-dev team mailing list archive

Re: status of search performance - why do we use | not & ?

 

Sort of along these lines, when doing advanced search if you specify a
set of tags to search on, by default it selects combinator 'Any', where
as if it defaulted to 'All' it would return a smaller set.

In addition, as a user that comes closer to my expectations of what the
default behavior would be.

On Sat, Jul 24, 2010 at 09:39:24AM +0200, Robert Collins wrote:
> I've been looking into our search performance, particularly on our
> duplicate finding page.
> 
> The basic problem appears to be selectivity (now that the
> table-scanning front-end code is eliminated).
> 
> We're doing an | operation to combine search terms.
> 
> That means, than when a user searches for 'installing eclipse I get an
> unmet dependencies error', we translate it into:
> SELECT some-stuff FROM Bug, BugTask WHERE Bug.id = BugTask.bug AND
> BugTask.distribution = 1 AND Bug.fti @@
> ftq('depend|eclips|error|get|instal|unmet') AND (Bug.private = FALSE
> OR EXISTS ( SELECT BugSubscription.bug FROM BugSubscription,
> TeamParticipation WHERE TeamParticipation.person = 2 AND
> BugSubscription.person = TeamParticipation.team AND
> BugSubscription.bug = Bug.id)) AND (1=1) ORDER BY -rank(bug.fti,
> ftq('depend|eclips|error|get|instal|unmet')), bugtask.id LIMIT 40
> OFFSET 0;
> 
> This evaluates 200000 rows to sort and then discards them:
>  count
> --------
>  216995
> (1 row)
> Time: 4897.453 ms
> 
> This 200000 result set is arguably fallout from removing the
> table-wide scanning front end code, but it is still less work overall:
> and its exposed the deeper problem of using |.
> 
> Consider this instead:
> SELECT count(*) FROM Bug, BugTask ,
> ftq('(depend&eclips&error&get&instal&unmet)|(error&get&instal&unmet)|(depend&eclips&get&instal&unmet)|(depend&eclips&error&instal&unmet)|(depend&eclips&error&get&unmet)|(depend&eclips&error&get&instal)')
> as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND
> Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT
> BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE
> TeamParticipation.person = 2 AND BugSubscription.person =
> TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1);
>  count
> -------
>    260
> (1 row)
> Time: 607.873 ms
> 
> This hand crafted query looks for 'any four of the terms'  When ranked
> and limited its:
> (40 rows)
> Time: 564.161 ms
> 
> Which is tolerable.
> 
> Alternatively, we can just search for all the terms supplied:
> 
> SELECT count(*) FROM Bug, BugTask ,
> ftq('(depend&eclips&error&get&instal&unmet)') as query WHERE Bug.id =
> BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND
> (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM
> BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2
> AND BugSubscription.person = TeamParticipation.team AND
> BugSubscription.bug = Bug.id)) AND (1=1);
>  count
> -------
>      0
> (1 row)
> Time: 62.975 ms
> 
> This is perhaps going to show 'nothing visible' when really its a near miss.
> 
> So, I'm thinking that perhaps a better way - and this is a band aid,
> until we fully review search - is to progressively return less and
> less refined searches until we have our N(defaults to 10 )or so
> similar bugs.
> 
> That is (psuedocode):
> for bug in find(depend&eclips&error&get&instal&unmet): yield bug
> for bug in find((error&get&instal&unmet)|(depend&eclips&get&instal&unmet)|(depend&eclips&error&instal&unmet)|(depend&eclips&error&get&unmet)|(depend&eclips&error&get&instal)):
> yield bug
> for bug in find ....
> 
> We should be able to represent this in sql with some effort - a first
> approximate would be
> SELECT 1, bugtask.id FROM Bug, BugTask ,
> ftq('(depend&eclips&error&get&instal&unmet)') as query WHERE Bug.id =
> BugTask.bug AND BugTask.distribution = 1 AND Bug.fti @@ query AND
> (Bug.private = FALSE OR EXISTS ( SELECT BugSubscription.bug FROM
> BugSubscription, TeamParticipation WHERE TeamParticipation.person = 2
> AND BugSubscription.person = TeamParticipation.team AND
> BugSubscription.bug = Bug.id)) AND (1=1) union SELECT 2, bugtask.id
> FROM Bug, BugTask ,
> ftq('(depend&eclips&error&get&instal&unmet)|(error&get&instal&unmet)|(depend&eclips&get&instal&unmet)|(depend&eclips&error&instal&unmet)|(depend&eclips&error&get&unmet)|(depend&eclips&error&get&instal)')
> as query WHERE Bug.id = BugTask.bug AND BugTask.distribution = 1 AND
> Bug.fti @@ query AND (Bug.private = FALSE OR EXISTS ( SELECT
> BugSubscription.bug FROM BugSubscription, TeamParticipation WHERE
> TeamParticipation.person = 2 AND BugSubscription.person =
> TeamParticipation.team AND BugSubscription.bug = Bug.id)) AND (1=1)
> order by 1 LIMIT 40 OFFSET 0;
> 
> Where each successive expansion gets a higher ordinal - some brief
> experiments suggest this doesn't short circuit though - even though
> its satisfied early on it keeps processing just incase one of the
> other statements in the union has a different first column.
> 
> The theory being:
>  - highly refined searches are very fast
>  - we'll get reasonably relevant (most terms found - sadly not tf-idf,
> but I have ideas for that) first
>  - we can stop far short of provoking 14-30 second searches on prod
> 
> There are, fortunately, only a few users of nl_search to modify in this way.
> 
> Stuart, if you could comment on my reasoning here, and on whether
> there is a good way to represent this directly in SQL (so we can pass
> around expression objects rather than something more complex), that
> would be wonderful.
> 
> Of course, the API wanting to count(*) will shoot us in the foot
> still, but we can fix API search once the web UI is fixed.
> 
> -Rob
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp



Follow ups

References