← Back to team overview

launchpad-dev team mailing list archive

Re: status of search performance - why do we use | not & ?

 

On Sat, Jul 24, 2010 at 2:39 PM, Robert Collins
<robert.collins@xxxxxxxxxxxxx> wrote:

> Stuart, if you could comment on my reasoning here, and on whether
> there is a good way to represent this directly in SQL (so we can pass
> around expression objects rather than something more complex), that
> would be wonderful.

I think it is a good idea doing the more specific queries first and
falling back to more expensive queries when not enough results are
returned. I don't think this can be done database side though unless
you use a stored procedure - you are better off issuing multiple
queries.

task_ids = []
num_wanted = 40
for task in specific_query(limit=num_wanted):
     yield task
     task_ids.append(task.id)
for task in less_specific_query(skip_ids=task_ids, limit=num_wanted -
len(task_ids)):
     yield task
     task_ids.append(task.id)
[...]

For further optimization, we should not do ordering on the server
side. Instead, we return task, rank(fti) and order on the client side.
This allows us to short circuit the crazy queries that return
thousands of matches. Instead of the ORDER BY clause, we add LIMIT
1001 and if we retrieve 1001 results raise a StupidSearchError and
handle it. The reason we cannot order on the server side with this
technique is that PG will need to materialize all the results to do
the ordering, which defeats the purpose of the short circuit.

I notice that we are passing in stemmed terms. tsearch2 will again
perform the stemming, so we may not be searching for what we thing we
are searching for.

ftq() is just a helper we wrote to construct a tsquery type, handling
google style booleans (AND, OR, NOT), hyphenation support ('foo-bar'
-> '(foobar|(foo&bar))' )and fixing syntax errors (tsearch2 insists on
a strict syntax and will raise an exception if passed a query like
'foo&&bar' or 'foo(bar' ). ftq() is an absolutely horrible stored
procedure embedded in the middle of fti.py, so it might be worth
moving this client side. As we are already spitting out native
tsearch2 booleans, we are better off directly invoking to_tsquery()
rather than ftq().

# select to_tsquery('foo&bar');
  to_tsquery
---------------
 'foo' & 'bar'
(1 row)



-- 
Stuart Bishop <stuart@xxxxxxxxxxxxxxxx>
http://www.stuartbishop.net/



References