← Back to team overview

launchpad-dev team mailing list archive

RFC: Should we keep the logical operators "&|!-" as full text search operators?

 

I am working currently on
https://bugs.launchpad.net/launchpad/+bug/1020443 (Full text search
broken for certain search terms having a "bad combination" of
punctuation characters, like "?!.". This a fallout bug from my previous
work on another full text search related bug:
https://bugs.launchpad.net/launchpad/+bug/29713 )

As explained by stub in a comment, the stored procedure ftq() does no
longer

I see two options to fix this bug:

(A) We can either fix the immediate problem (the fix would be quite
simple) and keep the feature "treat the characters '&|!' as logical
operators in full text searches".

(B) Let ftq() simply remove "&|!" from queries.

I prefer the option (B): Launchpad is about software development, so one
option to find a bug is to paste a source code line as a search query.
This can lead to odd results:

- a search for "#!/bin/sh" will find texts that _do not_ contain the
word "/bin/sh" because '!' is the logical "not" operator.

- a search for a source code line like

  foo = bar & !(baz | blah)

will not find a text containing the very same line: The full text index
contains the four words "foo", "bar", "baz", "blah", but the query is
"find texts containing 'foo' and 'bar' but not containing 'baz' or 'blah'...

Note that we already support the alternative search operators "AND",
"OR", "NOT". (Only the upper case words are treated as logical search
operators.) Here we don't have that many conflicts: These words are stop
words and thus not indexed, hence treating them in a special way does
not affect most queries -- with one likely exception: SQL queries pasted
as a a full text search query. But this "SQL search problem" is not new...

A related problem: If a word has a '-' as a prefix, the '-' is also
treated as a NOT operator. That does not cause any harm for words, but
breaks for numbers. The Postgres full text indexer parses numbers
different from words:

select to_tsvector('abc -def');
   to_tsvector
-----------------
 'abc':1 'def':2

select to_tsvector('123 -456');
   to_tsvector
------------------
 '-456':2 '123':1

so a search for "-456" won't find a text containing "123 -456" because
the query "-456" is converted into 'NOT 456'...

Hence I think we should no longer treat a leading '-' like a NOT.

cheers
Abel


Follow ups