← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~adeuring/launchpad/bug-1020443 into lp:launchpad

 

Abel Deuring has proposed merging lp:~adeuring/launchpad/bug-1020443 into lp:launchpad.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)

For more details, see:
https://code.launchpad.net/~adeuring/launchpad/bug-1020443/+merge/115721

This branch fixes bug 1020443: Search terms consisting of a tsquery operator
surrounded by punctuation can lead to OOPSes.

This bug is fallout from my previous work on bug 29713,
lp:~adeuring/launchpad/bug-29713-db-dev .

Background: We have a DB procedure ftq() which processes full text
query strings so that they can be passed to the Postgres procedure
to_tsquery() which in turn generates a query object that can be used
in SQL expressions like

    SELECT ... FROM bugtaskflat WHERE bugtaskflat.fti @@ query_object;

to_tsquery() supports more complex expressions, including grouping
terms with parentheses and the logical operators '&', '|' and '!',
with a simiar meaning as in C or Python.

It is obviously easy to provide syntactically incorrect search
expressions, so ftq() tries hard to fix possible errors -- but I broke
the proper treatment of "badly placed" logical operators in the branch
mentioned above.

The example query string from bug 1020443, "?!.", shows that users do
not always expect that "&", "|", "!" are treated in any special way,
so I asked in https://lists.launchpad.net/launchpad-dev/msg09536.html
if we should continue to treat "&|!" in full text searches as operators
or not. The email contains two more examples where typical source code
text pasted as a full text query leads to surprising results, due to
the interpretation of "&|!" as logical operators.

I got one "vote" from Curtis that ignoring "&|!" in search queries
makes sense and none votes againt doing this; since I think too that
using "&|!" as operators causes more confusion that being useful,
I removed this option.

The technical side: to_tsquery() interpret exactly the characters
"&|!" as logical operators; if they apper in text that is indexed,
they are treated like spaces. SO ftq() now replaces them too with
spaces.

To increase "search consistency" a bit more, I also removed the
"feature" that a '-' preceding a word is interpreted as a the "NOT"
operator because it makes it impossible to search for negative
numbers (which are stored including the '-' in the full text index).

Since ftq() is a stored procedure, I added the usual DB patch file.
The new file patch-2209-24-2.sql is an only slightly modified variant
of patch-2209-24-1.sql. The diff betwwen the two files:

+++ database/schema/patch-2209-24-2.sql 2012-07-19 11:41:02.170952338 +0200
@@ -17,11 +17,13 @@
         query = args[0].decode('utf8')
         ## plpy.debug('1 query is %s' % repr(query))
 
+        # Replace tsquery operators with ' '.
+        query = re.sub('[|&!]', ' ', query)
+
         # Normalize whitespace
         query = re.sub("(?u)\s+"," ", query)
 
-        # Convert AND, OR, NOT and - to tsearch2 punctuation
-        query = re.sub(r"(?u)(?:^|\s)-([\w\(])", r" !\1", query)
+        # Convert AND, OR, NOT to tsearch2 punctuation
         query = re.sub(r"(?u)\bAND\b", "&", query)
         query = re.sub(r"(?u)\bOR\b", "|", query)
         query = re.sub(r"(?u)\bNOT\b", " !", query)
@@ -34,9 +36,6 @@
         query = re.sub(r"(?u)%s+" % (punctuation,), " ", query)
         ## plpy.debug('3 query is %s' % repr(query))
 
-        # Strip ! characters inside and at the end of a word
-        query = re.sub(r"(?u)(?<=\w)[\!]+", " ", query)
-
         # Now that we have handle case sensitive booleans, convert to lowercase
         query = query.lower()
 
@@ -122,4 +121,4 @@
         return query or None
         $_$;
 
-INSERT INTO LaunchpadDatabaseRevision VALUES (2209, 24, 1);
+INSERT INTO LaunchpadDatabaseRevision VALUES (2209, 24, 2);


The part "@@ -34,9 +36,6 @@" removes a substitution that is not
longer needed since any "!" symbols are already removed from the
query string.

Since this a "hot patch"
(https://dev.launchpad.net/PolicyAndProcess/DatabaseSchemaChangesProcess)
-- just the change of a procedure butno schema changes, trigger change,
index change etc --, this MP uses the devel branch as the target.


test: ./bin/test services.database -vvt textsearching.txt

= Launchpad lint =

Checking for conflicts and issues in changed files.

Linting changed files:
  database/schema/patch-2209-24-2.sql
  lib/lp/services/database/doc/textsearching.txt

./lib/lp/services/database/doc/textsearching.txt
     734: want exceeds 78 characters.
make: *** [lint] Fehler 1

That's not caused by my changes -- and hard to fix...

-- 
https://code.launchpad.net/~adeuring/launchpad/bug-1020443/+merge/115721
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~adeuring/launchpad/bug-1020443 into lp:launchpad.
=== added file 'database/schema/patch-2209-24-2.sql'
--- database/schema/patch-2209-24-2.sql	1970-01-01 00:00:00 +0000
+++ database/schema/patch-2209-24-2.sql	2012-07-19 12:49:28 +0000
@@ -0,0 +1,124 @@
+-- Copyright 2012 Canonical Ltd.  This software is licensed under the
+-- GNU Affero General Public License version 3 (see the file LICENSE).
+
+SET client_min_messages=ERROR;
+
+CREATE OR REPLACE FUNCTION _ftq(text) RETURNS text
+    LANGUAGE plpythonu IMMUTABLE STRICT
+    AS $_$
+        import re
+
+        # I think this method would be more robust if we used a real
+        # tokenizer and parser to generate the query string, but we need
+        # something suitable for use as a stored procedure which currently
+        # means no external dependancies.
+
+        # Convert to Unicode
+        query = args[0].decode('utf8')
+        ## plpy.debug('1 query is %s' % repr(query))
+
+        # Replace tsquery operators with ' '.
+        query = re.sub('[|&!]', ' ', query)
+
+        # Normalize whitespace
+        query = re.sub("(?u)\s+"," ", query)
+
+        # Convert AND, OR, NOT to tsearch2 punctuation
+        query = re.sub(r"(?u)\bAND\b", "&", query)
+        query = re.sub(r"(?u)\bOR\b", "|", query)
+        query = re.sub(r"(?u)\bNOT\b", " !", query)
+        ## plpy.debug('2 query is %s' % repr(query))
+
+        # Deal with unwanted punctuation.
+        # ':' is used in queries to specify a weight of a word.
+        # '\' is treated differently in to_tsvector() and to_tsquery().
+        punctuation = r'[:\\]'
+        query = re.sub(r"(?u)%s+" % (punctuation,), " ", query)
+        ## plpy.debug('3 query is %s' % repr(query))
+
+        # Now that we have handle case sensitive booleans, convert to lowercase
+        query = query.lower()
+
+        # Remove unpartnered bracket on the left and right
+        query = re.sub(r"(?ux) ^ ( [^(]* ) \)", r"(\1)", query)
+        query = re.sub(r"(?ux) \( ( [^)]* ) $", r"(\1)", query)
+
+        # Remove spurious brackets
+        query = re.sub(r"(?u)\(([^\&\|]*?)\)", r" \1 ", query)
+        ## plpy.debug('5 query is %s' % repr(query))
+
+        # Insert & between tokens without an existing boolean operator
+        # ( not proceeded by (|&!
+        query = re.sub(r"(?u)(?<![\(\|\&\!])\s*\(", "&(", query)
+        ## plpy.debug('6 query is %s' % repr(query))
+        # ) not followed by )|&
+        query = re.sub(r"(?u)\)(?!\s*(\)|\||\&|\s*$))", ")&", query)
+        ## plpy.debug('6.1 query is %s' % repr(query))
+        # Whitespace not proceded by (|&! not followed by &|
+        query = re.sub(r"(?u)(?<![\(\|\&\!\s])\s+(?![\&\|\s])", "&", query)
+        ## plpy.debug('7 query is %s' % repr(query))
+
+        # Detect and repair syntax errors - we are lenient because
+        # this input is generally from users.
+
+        # Fix unbalanced brackets
+        openings = query.count("(")
+        closings = query.count(")")
+        if openings > closings:
+            query = query + " ) "*(openings-closings)
+        elif closings > openings:
+            query = " ( "*(closings-openings) + query
+        ## plpy.debug('8 query is %s' % repr(query))
+
+        # Strip ' character that do not have letters on both sides
+        query = re.sub(r"(?u)((?<!\w)'|'(?!\w))", "", query)
+
+        # Brackets containing nothing but whitespace and booleans, recursive
+        last = ""
+        while last != query:
+            last = query
+            query = re.sub(r"(?u)\([\s\&\|\!]*\)", "", query)
+        ## plpy.debug('9 query is %s' % repr(query))
+
+        # An & or | following a (
+        query = re.sub(r"(?u)(?<=\()[\&\|\s]+", "", query)
+        ## plpy.debug('10 query is %s' % repr(query))
+
+        # An &, | or ! immediatly before a )
+        query = re.sub(r"(?u)[\&\|\!\s]*[\&\|\!]+\s*(?=\))", "", query)
+        ## plpy.debug('11 query is %s' % repr(query))
+
+        # An &,| or ! followed by another boolean.
+        query = re.sub(r"(?ux) \s* ( [\&\|\!] ) [\s\&\|]+", r"\1", query)
+        ## plpy.debug('12 query is %s' % repr(query))
+
+        # Leading & or |
+        query = re.sub(r"(?u)^[\s\&\|]+", "", query)
+        ## plpy.debug('13 query is %s' % repr(query))
+
+        # Trailing &, | or !
+        query = re.sub(r"(?u)[\&\|\!\s]+$", "", query)
+        ## plpy.debug('14 query is %s' % repr(query))
+
+        # If we have nothing but whitespace and tsearch2 operators,
+        # return NULL.
+        if re.search(r"(?u)^[\&\|\!\s\(\)]*$", query) is not None:
+            return None
+
+        # Convert back to UTF-8
+        query = query.encode('utf8')
+        ## plpy.debug('15 query is %s' % repr(query))
+
+        return query or None
+        $_$;
+
+CREATE OR REPLACE FUNCTION ftq(text) RETURNS pg_catalog.tsquery
+    LANGUAGE plpythonu IMMUTABLE STRICT
+    AS $_$
+        p = plpy.prepare(
+            "SELECT to_tsquery('default', _ftq($1)) AS x", ["text"])
+        query = plpy.execute(p, args, 1)[0]["x"]
+        return query or None
+        $_$;
+
+INSERT INTO LaunchpadDatabaseRevision VALUES (2209, 24, 2);

=== renamed file 'database/schema/patch-2209-24-2.sql' => 'database/schema/patch-2209-24-2.sql.moved'
=== modified file 'lib/lp/services/database/doc/textsearching.txt'
--- lib/lp/services/database/doc/textsearching.txt	2012-06-26 09:40:38 +0000
+++ lib/lp/services/database/doc/textsearching.txt	2012-07-19 12:49:28 +0000
@@ -172,23 +172,16 @@
     >>> ftq('hi AND mom')
     hi&mom <=> 'hi' & 'mom'
 
-    >>> ftq('hi & mom')
-    hi&mom <=> 'hi' & 'mom'
-
     >>> ftq('hi OR mom')
     hi|mom <=> 'hi' | 'mom'
 
-    >>> ftq('hi | mom')
-    hi|mom <=> 'hi' | 'mom'
-
-    >>> ftq('hi & -dad')
+    >>> ftq('hi AND NOT dad')
     hi&!dad <=> 'hi' & !'dad'
 
 
-
 Brackets are allowed to specify precidence
 
-    >>> ftq('(HI OR HELLO) & mom')
+    >>> ftq('(HI OR HELLO) AND mom')
     (hi|hello)&mom <=> ( 'hi' | 'hello' ) & 'mom'
 
     >>> ftq('Hi(Mom)')
@@ -203,19 +196,16 @@
     >>> ftq('foo(bar OR baz)') # Bug #32071
     foo&(bar|baz) <=> 'foo' & ( 'bar' | 'baz' )
 
-    >>> ftq('foo (bar OR baz)')
-    foo&(bar|baz) <=> 'foo' & ( 'bar' | 'baz' )
-
 
 We also support negation
 
-    >>> ftq('!Hi')
+    >>> ftq('NOT Hi')
     !hi <=> !'hi'
 
-    >>> ftq('-(Hi & Mom)')
+    >>> ftq('NOT(Hi AND Mom)')
     !(hi&mom) <=> !( 'hi' & 'mom' )
 
-    >>> ftq('Foo & ! Bar')
+    >>> ftq('Foo AND NOT Bar')
     foo&!bar <=> 'foo' & !'bar'
 
 
@@ -224,7 +214,7 @@
     >>> ftq('Hi Mom')
     hi&mom <=> 'hi' & 'mom'
 
-    >>> ftq('Hi -mom')
+    >>> ftq('Hi NOT mom')
     hi&!mom <=> 'hi' & !'mom'
 
     >>> ftq('hi (mom OR mum)')
@@ -233,18 +223,34 @@
     >>> ftq('(hi OR hello) mom')
     (hi|hello)&mom <=> ( 'hi' | 'hello' ) & 'mom'
 
-    >>> ftq('(hi OR hello) -mom')
+    >>> ftq('(hi OR hello) NOT mom')
     (hi|hello)&!mom <=> ( 'hi' | 'hello' ) & !'mom'
 
     >>> ftq('(hi ho OR hoe) work go')
     (hi&ho|hoe)&work&go <=> ( 'hi' & 'ho' | 'hoe' ) & 'work' & 'go'
 
 
-If a single '-' precedes a word, it is converted into the '!' operator.
-Note also that a trailing '-' is dropped by to_tsquery().
-
-    >>> ftq('-foo bar-')
-    !foo&bar- <=> !'foo' & 'bar'
+'-' symbols are treated by the Postgres FTI parser context sensitive.
+If they precede a word, they are removed.
+
+    >>> print search_same('foo -bar')
+    FTI data: 'bar':2 'foo':1
+    query: 'foo' & 'bar'
+    match: True
+
+If a '-' precedes a number, it is retained.
+
+    >>> print search_same('123 -456')
+    FTI data: '-456':2 '123':1
+    query: '123' & '-456'
+    match: True
+
+Trailing '-' are always ignored.
+
+    >>> print search_same('bar- 123-')
+    FTI data: '123':2 'bar':1
+    query: 'bar' & '123'
+    match: True
 
 Repeated '-' are simply ignored by to_tsquery().
 
@@ -259,6 +265,12 @@
     query: 'foo-bar' & 'foo' & 'bar'
     match: True
 
+A '-' surrounded by numbers is treated as the sign of the right-hand number.
+
+    >>> print search_same('123-456')
+    FTI data: '-456':2 '123':1
+    query: '123' & '-456'
+    match: True
 
 Punctuation is handled consistently. If a string containing punctuation
 appears in an FTI, it can also be passed to ftq(),and a search for this
@@ -342,11 +354,36 @@
     >>> print search('some text <div>whatever</div>', 'div')
     FTI data: 'text':2 'whatev':3 query: 'div' match: False
 
-Treatment of characters that are used as operators in to_tsquery():
+The symbols '&', '|' and '!' are treated as operators by to_tsquery();
+to_tsvector() treats them as whitespace. ftq() converts the words 'AND',
+'OR', 'NOT' are into these operators expected by to_tsquery(), and it
+replaces the symbols '&', '|' and '!' with spaces. This avoids
+surprising search results when the operator symbols appear accidentally
+in search terms, e.g., by using a plain copy of a source code line as
+the search term.
 
     >>> ftq('cool!')
     cool <=> 'cool'
 
+    >>> print search_same('Shell scripts usually start with #!/bin/sh.')
+    FTI data: '/bin/sh':6 'script':2 'shell':1 'start':4 'usual':3
+    query: 'shell' & 'script' & 'usual' & 'start' & '/bin/sh'
+    match: True
+
+    >>> print search_same('int foo = (bar & ! baz) | bla;')
+    FTI data: 'bar':3 'baz':4 'bla':5 'foo':2 'int':1
+    query: 'int' & 'foo' & 'bar' & 'baz' & 'bla'
+    match: True
+
+Queries containing only punctuation symbols yield an empty ts_query
+object. Note that _ftq() first replaces the '!' with a ' '; later on,
+_ftq() joins the two remaining terms '?' and '.' with the "AND"
+operator '&'. Finally, to_tsquery() detects the AND combination of
+two symbols that are not tokenized and returns null.
+
+    >>> ftq('?!.') # Bug 1020443
+    ?&. <=> None
+
 Email addresses are retained as a whole, both by to_tsvector() and by
 ftq().
 
@@ -434,7 +471,7 @@
 Dud queries are 'repaired', such as doubled operators, trailing operators
 or invalid leading operators
 
-    >>> ftq('hi & OR mom')
+    >>> ftq('hi AND OR mom')
     hi&mom <=> 'hi' & 'mom'
 
     >>> ftq('(hi OR OR hello) AND mom')
@@ -443,7 +480,7 @@
     >>> ftq('(hi OR AND hello) AND mom')
     (hi|hello)&mom <=> ( 'hi' | 'hello' ) & 'mom'
 
-    >>> ftq('(hi OR -AND hello) AND mom')
+    >>> ftq('(hi OR NOT AND hello) AND mom')
     (hi|!hello)&mom <=> ( 'hi' | !'hello' ) & 'mom'
 
     >>> ftq('(hi OR - AND hello) AND mom')
@@ -452,13 +489,13 @@
     >>> ftq('hi AND mom AND')
     hi&mom <=> 'hi' & 'mom'
 
-    >>> ftq('& hi & mom')
+    >>> ftq('AND hi AND mom')
     hi&mom <=> 'hi' & 'mom'
 
-    >>> ftq('(& hi | hello) AND mom')
+    >>> ftq('(AND hi OR hello) AND mom')
     (hi|hello)&mom <=> ( 'hi' | 'hello' ) & 'mom'
 
-    >>> ftq('() hi mom ( ) ((! |((&)))) :-)')
+    >>> ftq('() hi mom ( ) ((NOT OR((AND)))) :-)')
     (hi&mom&-) <=> 'hi' & 'mom'
 
     >>> ftq("(hi mom")
@@ -502,10 +539,10 @@
 
     Bug #160236
 
-    >>> ftq("foo&&bar-baz")
+    >>> ftq("foo AND AND bar-baz")
     foo&bar-baz <=> 'foo' & 'bar-baz' & 'bar' & 'baz'
 
-    >>> ftq("foo||bar.baz")
+    >>> ftq("foo OR OR bar.baz")
     foo|bar.baz <=> 'foo' | 'bar.baz'
 
 


Follow ups