launchpad-reviewers team mailing list archive

Thread
Date

[Merge] lp:~adeuring/launchpad/bug-29713 into lp:launchpad

To: mp+111212@xxxxxxxxxxxxxxxxxx
From: Abel Deuring <abel.deuring@xxxxxxxxxxxxx>
Date: Wed, 20 Jun 2012 12:44:18 -0000
Reply-to: mp+111212@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

Abel Deuring has proposed merging lp:~adeuring/launchpad/bug-29713 into lp:launchpad.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)

For more details, see:
https://code.launchpad.net/~adeuring/launchpad/bug-29713/+merge/111212

This branch fixes many of the issues described in bug 29273:
bug search fails to find results when punctuation is adjacent to
regular text in the document (e.g. '"from"', '<div>')

The two remaining issues are described in bug 1015511 and bug 1015519.

My main "guide line" for this branch was that any text fragment copied
from a full-text-index text and used as a search string should return
the text it was copied from. (Obvious constraints: The words should
not be stop words, and only whole words should be copied.)

We have two DB procedures ftq() and _ftq() which are used to build
full text search queries similar to:

     SELECT ... FROM bugtaskflat
       WHERE bugtaskflat.fti @@ ftq('search text provided by LP user');

ftq() prepares the string given by a user so that it can be passed to
the procedure to_tsquery(). _ftq() is a debugging variant: While
ftq() returns the result from calling to_tsquery(process_query_string),
it returns just processed_query_string.

The problems described in bugs 1015511 and 1015519 aside, the main
issues were

  (1) an "overly eager" replacement of punctuation characters with "-"
  (2) a replacement like

      aaa-bbb -> (aaabbb | (aaa & bbb))

1. Hyphenation handling: The old code

        # Convert foo-bar to ((foo&bar)|foobar) and foo-bar-baz to
        # ((foo&bar&baz)|foobarbaz)
        def hyphen_repl(match):
            bits = match.group(0).split("-")
            return "((%s)|%s)" % ("&".join(bits), "".join(bits))
        query = re.sub(r"(?u)\b\w+-[\w\-]+\b", hyphen_repl, query)
        ## plpy.debug('4 query is %s' % repr(query))

        # Any remaining - characters are spurious
        query = query.replace('-','')

was outdated: This converts the string 'foo-bar' into
the search expression

    'foobar|(foo&bar)'.

But the FTI data stored by Postgres for the string 'foo-bar' is

    select to_tsvector('foo-bar');
             to_tsvector
    -----------------------------
     'bar':3 'foo':2 'foo-bar':1

Applying to_tsquery('foobar|(foo&bar)') to the above FTI data would
return a match, but other manipulations by ftq() (now also
changed/removed) lead to "search failures" for many typical filenames,
see below.

Moreover, ftq() does not need to decompose 'foo-bar' into 'foo' and 'bar'
because to_tsquery() does this itself, and in a way that matches the
data produced by to_tsvector() better:

    select to_tsquery('foo-bar');
            to_tsquery
    ---------------------------
     'foo-bar' & 'foo' & 'bar'

Finally, the old hypen handling breaks the search for filenames containing
hypens. I added a test for this case.

So I simply removed the code above.


2. Much punctuation was replaced with a '-'. This leads, combined with the
issue described above, to the problem from the bug report that full text
searches for file names fail.

Example:

    select _ftq('file-name.py');
                _ftq
    -----------------------------
     ((file&name&py)|filenamepy)

    select ftq('file-name.py');
                      ftq
    ---------------------------------------
     'file' & 'name' & 'py' | 'filenamepi'

while the FTI data looks like

    select to_tsvector('file-name.py');
       to_tsvector
    ------------------
     'file-name.py':1

So, the FTI stores just the plain filename, nothing else, while the query
data asks to look for a few sightly different terms.

On the other hand, to_tsquery() handles file names just fine:

    select to_tsquery('file-name.py');
       to_tsquery
    ----------------
     'file-name.py'


The following part of the current implementation of ftq() replaces
a number of characters qith a '.':

        punctuation = r"[^\w\s\-\&\|\!\(\)']"
        query = re.sub(r"(?u)(\w)%s+(\w)" % (punctuation,), r"\1-\2", query)
        query = re.sub(r"(?u)%s+" % (punctuation,), " ", query)

This means that the symbols "#$%*+,./:;<=>?@[\]^`{}~ (and a larger set of
other Unicode symbols: anything that is not a word or white space or
contained in '-&|!()') are replaced with a '-' if they "connect" two
words, and that they are replaced with a ' ' otherwise. A comparison with
the FTI data:

    select to_tsvector('foo"bar');
       to_tsvector
    -----------------
     'bar':2 'foo':1

    select to_tsquery('foo"bar');
      to_tsquery
    ---------------
     'foo' & 'bar'

    select ftq('foo"bar');
                ftq
    ---------------------------
     'foo-bar' & 'foo' & 'bar'

(the last output comes from ftq() having the changes described under (1).)

So, passing the unchanged search string to to_tsquery() will create
a query object that will match against the FTI data, while the search
string mangled by ftq() will not find the original string: 'foo-bar'
is not part of the FTI data.

I think the main lesson from the example above is that we should let
to_tsquery() handle punctuation as much as possible, because it treats
punctuation almost identical to to_tsvector(). Two execptions remain:
'\' and ':'. The new tests beginning in line 263 of
lib/lp/services/database/doc/textsearching.txt show that punctuation
is now handled correctly.

Bug 33920 had a special section in the doc test ("It was noticed though
in Bug #33920..."). As the new test part "Punctuation is handled
consistently..." shows, this is no longer an issue, so I removed the tests
about bug 33920.

Same for bug 39828 ("It was also noticed through Bug #39828" in the original
doc test).

I found the test "Bug #44913 - Unicode characters in the wrong place"
a bit confusing: "What happened to the leading 'a-'"? The answer is
simple: 'a' is a stop word. But I think this distracts from the main
purpose of the test, so I replaced 'a-' with 'abc-'. This results in the
usual query produced by ftq() and to_tsquery() for words containing a '-'.


Miscellaneous
-------------

The stored procedures ftq() and _ftq() were nearly identical: _ftq()
just processes a query string and returns the processed query -- the
procedure is used only for tests/debugging; ftq() did the same processing
but returns the result of

    SELECT to_tsquery(processed_query)

I changed the latter procedure to just call

    SELECT to_tsquery(_ftq(query_string))

This make the code a bit shorter and avoids the risk that the implementation
of ftq() and _ftq() diverges.

The doc test defines a function ftq() which returns the results of calls
to the stored procedures ftq() and _ftq(). This function helps to understand
how these procedures process a query string, but the test function does
not help to check if a given query will match the FTI data of a given text.

Some of the "bad queries" mentioned in bug 29713 can only be understood
by looking at both the Postgres query object for a given search term and
at the FTI data stored for the same term. So I added two test helpers
search(full_text, query_text) and search_same(text) which show the FTI
data for full_text, the Postgres query object for query_text and the result
of searching query_text in full_text.

I changed some of the existing tests to use search() instead of ftq()
because I think that the former function shows a bit better that the
search term can be properly used. For example, the old test for the query
u'a-a\N{RIGHT DOUBLE QUOTATION MARK}' showed that the quotation mark
was removed from the query. This seems to be useful -- but the new test
using search_same() shows that the FTI data for this string contains the
quotation mark, hence the FTI query object should keep it too.

(The fact that the quotation mark is treated by to_tsvector() as being
part of a word is of course a bug, but that is outside the the scope
of this branch. My goal for now is to clean up ftq(). A proper fix would
be to tweak the parser used by Postgres in to_tsquery() and in
to_tsvector().)

LOC counting
------------

The diff between the recent version of the branch and revision 15402
(the version I branched from trunk) shows 302 added lines and 116 removed
lines.

But I'd claim that this is a temporary increase of lines. My first commit
(r15403) just adds a DB patch file which contains a plain copy of the
existing procedures ftq() and _ftq(). Running

bzr diff -r 15403 | grep ^+ | wc
bzr diff -r 15403 | grep ^- | wc

shows 194 added lines and 254 removed lines. I'd claim that this count
is more appropriate than the count of added/removed lines against trunk
because the the DB patch file will finally disappear, when a new
"main schema file" (database/schema/launchpad-NNNN-00-0.sql) will be
generated. And the diff against r15403 for the file
database/schema/patch-2209-24-1.sql shows changes that are similar to the
changes we can expect betwwen the current
database/schema/launchpad-2209-00-0.sql and a future
database/schema/launchpad-NNNN-00-0.sql.

test: ./bin/test services -vvt textsearching.txt

= Launchpad lint =

Checking for conflicts and issues in changed files.

Linting changed files:
  database/schema/patch-2209-24-1.sql
  lib/lp/services/database/doc/textsearching.txt

./lib/lp/services/database/doc/textsearching.txt
     697: want exceeds 78 characters.

That's not caused by my changes.

-- 
https://code.launchpad.net/~adeuring/launchpad/bug-29713/+merge/111212
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~adeuring/launchpad/bug-29713 into lp:launchpad.

=== added file 'database/schema/patch-2209-24-1.sql'
--- database/schema/patch-2209-24-1.sql	1970-01-01 00:00:00 +0000
+++ database/schema/patch-2209-24-1.sql	2012-06-20 12:43:45 +0000
@@ -0,0 +1,125 @@
+-- Copyright 2012 Canonical Ltd.  This software is licensed under the
+-- GNU Affero General Public License version 3 (see the file LICENSE).
+
+SET client_min_messages=ERROR;
+
+CREATE OR REPLACE FUNCTION _ftq(text) RETURNS text
+    LANGUAGE plpythonu IMMUTABLE STRICT
+    AS $_$
+        import re
+
+        # I think this method would be more robust if we used a real
+        # tokenizer and parser to generate the query string, but we need
+        # something suitable for use as a stored procedure which currently
+        # means no external dependancies.
+
+        # Convert to Unicode
+        query = args[0].decode('utf8')
+        ## plpy.debug('1 query is %s' % repr(query))
+
+        # Normalize whitespace
+        query = re.sub("(?u)\s+"," ", query)
+
+        # Convert AND, OR, NOT and - to tsearch2 punctuation
+        query = re.sub(r"(?u)(?:^|\s)-([\w\(])", r" !\1", query)
+        query = re.sub(r"(?u)\bAND\b", "&", query)
+        query = re.sub(r"(?u)\bOR\b", "|", query)
+        query = re.sub(r"(?u)\bNOT\b", " !", query)
+        ## plpy.debug('2 query is %s' % repr(query))
+
+        # Deal with unwanted punctuation.
+        # ':' is used in queries to specify a weight of a word.
+        # '\' is treated differently in to_tsvector() and to_tsquery().
+        punctuation = r'[:\\]'
+        query = re.sub(r"(?u)%s+" % (punctuation,), " ", query)
+        ## plpy.debug('3 query is %s' % repr(query))
+
+        # Strip ! characters inside and at the end of a word
+        query = re.sub(r"(?u)(?<=\w)[\!]+", " ", query)
+
+        # Now that we have handle case sensitive booleans, convert to lowercase
+        query = query.lower()
+
+        # Remove unpartnered bracket on the left and right
+        query = re.sub(r"(?ux) ^ ( [^(]* ) \)", r"(\1)", query)
+        query = re.sub(r"(?ux) \( ( [^)]* ) $", r"(\1)", query)
+
+        # Remove spurious brackets
+        query = re.sub(r"(?u)\(([^\&\|]*?)\)", r" \1 ", query)
+        ## plpy.debug('5 query is %s' % repr(query))
+
+        # Insert & between tokens without an existing boolean operator
+        # ( not proceeded by (|&!
+        query = re.sub(r"(?u)(?<![\(\|\&\!])\s*\(", "&(", query)
+        ## plpy.debug('6 query is %s' % repr(query))
+        # ) not followed by )|&
+        query = re.sub(r"(?u)\)(?!\s*(\)|\||\&|\s*$))", ")&", query)
+        ## plpy.debug('6.1 query is %s' % repr(query))
+        # Whitespace not proceded by (|&! not followed by &|
+        query = re.sub(r"(?u)(?<![\(\|\&\!\s])\s+(?![\&\|\s])", "&", query)
+        ## plpy.debug('7 query is %s' % repr(query))
+
+        # Detect and repair syntax errors - we are lenient because
+        # this input is generally from users.
+
+        # Fix unbalanced brackets
+        openings = query.count("(")
+        closings = query.count(")")
+        if openings > closings:
+            query = query + " ) "*(openings-closings)
+        elif closings > openings:
+            query = " ( "*(closings-openings) + query
+        ## plpy.debug('8 query is %s' % repr(query))
+
+        # Strip ' character that do not have letters on both sides
+        query = re.sub(r"(?u)((?<!\w)'|'(?!\w))", "", query)
+
+        # Brackets containing nothing but whitespace and booleans, recursive
+        last = ""
+        while last != query:
+            last = query
+            query = re.sub(r"(?u)\([\s\&\|\!]*\)", "", query)
+        ## plpy.debug('9 query is %s' % repr(query))
+
+        # An & or | following a (
+        query = re.sub(r"(?u)(?<=\()[\&\|\s]+", "", query)
+        ## plpy.debug('10 query is %s' % repr(query))
+
+        # An &, | or ! immediatly before a )
+        query = re.sub(r"(?u)[\&\|\!\s]*[\&\|\!]+\s*(?=\))", "", query)
+        ## plpy.debug('11 query is %s' % repr(query))
+
+        # An &,| or ! followed by another boolean.
+        query = re.sub(r"(?ux) \s* ( [\&\|\!] ) [\s\&\|]+", r"\1", query)
+        ## plpy.debug('12 query is %s' % repr(query))
+
+        # Leading & or |
+        query = re.sub(r"(?u)^[\s\&\|]+", "", query)
+        ## plpy.debug('13 query is %s' % repr(query))
+
+        # Trailing &, | or !
+        query = re.sub(r"(?u)[\&\|\!\s]+$", "", query)
+        ## plpy.debug('14 query is %s' % repr(query))
+
+        # If we have nothing but whitespace and tsearch2 operators,
+        # return NULL.
+        if re.search(r"(?u)^[\&\|\!\s\(\)]*$", query) is not None:
+            return None
+
+        # Convert back to UTF-8
+        query = query.encode('utf8')
+        ## plpy.debug('15 query is %s' % repr(query))
+
+        return query or None
+        $_$;
+
+CREATE OR REPLACE FUNCTION ftq(text) RETURNS pg_catalog.tsquery
+    LANGUAGE plpythonu IMMUTABLE STRICT
+    AS $_$
+        p = plpy.prepare(
+            "SELECT to_tsquery('default', _ftq($1)) AS x", ["text"])
+        query = plpy.execute(p, args, 1)[0]["x"]
+        return query or None
+        $_$;
+
+INSERT INTO LaunchpadDatabaseRevision VALUES (2209, 24, 1);

=== modified file 'lib/lp/services/database/doc/textsearching.txt'
--- lib/lp/services/database/doc/textsearching.txt	2011-12-30 06:14:56 +0000
+++ lib/lp/services/database/doc/textsearching.txt	2012-06-20 12:43:45 +0000
@@ -138,7 +138,22 @@
     ...         compiled = compiled.decode('UTF-8')
     ...         compiled = compiled.encode('US-ASCII', 'backslashreplace')
     ...     print '%s <=> %s' % (uncompiled, compiled)
-
+    >>>
+    >>> def search(text_to_search, search_phrase):
+    ...     cur = cursor()
+    ...     cur.execute("SELECT to_tsvector(%s)", (text_to_search, ))
+    ...     ts_vector = cur.fetchall()[0][0]
+    ...     cur.execute("SELECT ftq(%s)", (search_phrase, ))
+    ...     ts_query = cur.fetchall()[0][0]
+    ...     cur.execute(
+    ...         "SELECT to_tsvector(%s) @@ ftq(%s)",
+    ...         (text_to_search, search_phrase))
+    ...     match = cur.fetchall()[0][0]
+    ...     return "FTI data: %s query: %s match: %s" % (
+    ...         ts_vector, ts_query, str(match))
+    >>>
+    >>> def search_same(text):
+    ...     return search(text, text)
 
 Queries are lowercased
 
@@ -225,127 +240,178 @@
     (hi&ho|hoe)&work&go <=> ( 'hi' & 'ho' | 'hoe' ) & 'work' & 'go'
 
 
-Hypenation is handled specially. Note that the & operator has precidence
-over the | operator and that tsearch2 removes the unnecessary branckets.
-
-    >>> ftq('foo-bar')
-    ((foo&bar)|foobar) <=> 'foo' & 'bar' | 'foobar'
-
-    >>> ftq('foo-bar-baz')
-    ((foo&bar&baz)|foobarbaz) <=> 'foo' & 'bar' & 'baz' | 'foobarbaz'
-
-    >>> ftq('foo & bar-baz')
-    foo&((bar&baz)|barbaz) <=> 'foo' & ( 'bar' & 'baz' | 'barbaz' )
+If a single '-' precedes a word, it is converted into the '!' operator.
+Note also that a trailing '-' is dropped by to_tsquery().
 
     >>> ftq('-foo bar-')
-    !foo&bar <=> !'foo' & 'bar'
+    !foo&bar- <=> !'foo' & 'bar'
+
+Repeated '-' are simply ignored by to_tsquery().
 
     >>> ftq('---foo--- ---bar---')
-    foo&bar <=> 'foo' & 'bar'
-
-    >>> ftq('foo-bar test')
-    ((foo&bar)|foobar)&test <=> ( 'foo' & 'bar' | 'foobar' ) & 'test'
-
-    >>> ftq('foo-bar OR test')
-    ((foo&bar)|foobar)|test <=> ( 'foo' & 'bar' | 'foobar' ) | 'test'
-
-
-Most punctuation characters are converted to whitespace outside of
-words, or treated as a hypen inside words. The exceptions are the
-operators ()!&|!.
-
-    >>> ftq(':100%')
-    100 <=> '100'
-
-    >>> ftq(r'foo\bar')
-    ((foo&bar)|foobar) <=> 'foo' & 'bar' | 'foobar'
-
-    >>> ftq('/dev/pmu')
-    ((dev&pmu)|devpmu) <=> 'dev' & 'pmu' | 'devpmu'
+    ---foo---&---bar--- <=> 'foo' & 'bar'
+
+Hyphens surrounded by two words are retained. This reflects the way
+how to_tsquery() and to_tsvector() handle such strings.
+
+    >>> print search_same('foo-bar')
+    FTI data: 'bar':3 'foo':2 'foo-bar':1
+    query: 'foo-bar' & 'foo' & 'bar'
+    match: True
+
+
+Punctuation is handled consistently. If a string containing punctuation
+appears in an FTI, it can also be passed to ftq(),and a search for this
+string finds the indexed text.
+
+    >>> punctuation = '\'"#$%*+,./:;<=>?@[\]^`{}~'
+    >>> for symbol in punctuation:
+    ...     print repr(symbol), search_same('foo%sbar' % symbol)
+    "'" FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '"' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '#' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '$' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '%' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '*' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '+' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    ',' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '.' FTI data: 'foo.bar':1 query: 'foo.bar' match: True
+    '/' FTI data: 'foo/bar':1 query: 'foo/bar' match: True
+    ':' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    ';' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '<' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '=' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '>' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '?' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '@' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '[' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '\\' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    ']' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '^' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '`' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '{' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '}' FTI data: 'bar':2 'foo':1 query: 'foo' & 'bar' match: True
+    '~' FTI data: 'foo':1 '~bar':2 query: 'foo' & '~bar' match: True
+
+    >>> for symbol in punctuation:
+    ...     print repr(symbol), search_same('aa %sbb%s cc' % (symbol, symbol))
+    "'" FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '"' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '#' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '$' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '%' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '*' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '+' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    ',' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '.' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '/' FTI data: '/bb':2 'aa':1 'cc':3 query: 'aa' & '/bb' & 'cc' match: True
+    ':' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    ';' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '<' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '=' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '>' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '?' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '@' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '[' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '\\' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    ']' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '^' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '`' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '{' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '}' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & 'bb' & 'cc' match: True
+    '~' FTI data: 'aa':1 'bb':2 'cc':3 query: 'aa' & '~bb' & 'cc' match: False
+
+XXX Abel Deuring 2012-06-20 bug=1015511: Note that the last line above
+shows a bug: The FTI data for the string "aa ~bb~ cc" contains the words
+'aa', 'bb', 'cc', while the ts_query object for the same text contains
+'aa', '~bb', 'cc', hence the query does not match the string. More details_
+
+XXX Abel Deuring 2012-06-20 bug=1015519: XML tags cannot be searched.
+
+    >>> print search_same('foo <bar> baz')
+    FTI data: 'baz':2 'foo':1 query: 'foo' & 'baz' match: True
+
+More specifically, tags are simply dropped from the FTI data and from
+search queries.
+
+    >>> print search('some text <div>whatever</div>', '<div>')
+    FTI data: 'text':2 'whatev':3 query: None match: None
+
+Of course, omitting '<' and '>'from the query does not help.
+
+    >>> print search('some text <div>whatever</div>', 'div')
+    FTI data: 'text':2 'whatev':3 query: 'div' match: False
+
+Treatment of characters that are used as operators in to_tsquery():
 
     >>> ftq('cool!')
     cool <=> 'cool'
 
-    >>> ftq('foo@xxxxxxx')
-    ((foo&bar&com)|foobarcom) <=> 'foo' & 'bar' & 'com' | 'foobarcom'
-
+Email addresses are retained as a whole, both by to_tsvector() and by
+ftq().
+
+    >>> print search_same('foo@xxxxxxx')
+    FTI data: 'foo@xxxxxxx':1 query: 'foo@xxxxxxx' match: True
+
+File names are retained as a whole.
+
+    >>> print search_same('foo-bar.txt')
+    FTI data: 'foo-bar.txt':1 query: 'foo-bar.txt' match: True
 
 Some punctuation we pass through to tsearch2 for it to handle.
-
-    >>> ftq("shouldn't") # NB. This gets stemmed, see below
-    shouldn't <=> 'shouldn'
-
-It was noticed though in Bug #33920 that tsearch2 couldn't cope if the
-apostrophe was not inside a word. So we strip it in these cases.
-
-    >>> ftq("'cool")
-    cool <=> 'cool'
-    >>> ftq("'shouldn't")
-    shouldn't <=> 'shouldn'
-    >>> ftq("' cool")
-    cool <=> 'cool'
-    >>> ftq("cool '")
-    cool <=> 'cool'
-    >>> ftq("' cool '")
-    cool <=> 'cool'
-    >>> ftq("'cool'")
-    cool <=> 'cool'
-    >>> ftq("('cool' AND bananas)")
-    (cool&bananas) <=> 'cool' & 'banana'
-
-It was also noticed through Bug #39828 that tsearch2 will not cope if the
-! character is embedded inside or found at the end of a word.
-
-    >>> ftq('cool!')
-    cool <=> 'cool'
-    >>> ftq('hi!mom')
-    hi&mom <=> 'hi' & 'mom'
-    >>> ftq('hi!!!!!mom')
-    hi&mom <=> 'hi' & 'mom'
-    >>> ftq('hi !mom')
-    hi&!mom <=> 'hi' & !'mom'
-
-
-Bug #44913 - Unicode characters in the wrong place
-
-    >>> ftq(u'a-a\N{LATIN SMALL LETTER C WITH CEDILLA}')
-    ((a&a\xe7)|aa\xe7) <=> 'a\xe7' | 'aa\xe7'
-
-    Cut & Paste of 'Smart' quotes
-
-    >>> ftq(u'a-a\N{RIGHT DOUBLE QUOTATION MARK}')
-    ((a&a)|aa) <=> 'aa'
-
-    >>> ftq(u'\N{LEFT SINGLE QUOTATION MARK}a.a\N{RIGHT SINGLE QUOTATION MARK}')
-    ((a&a)|aa) <=> 'aa'
+NB. This gets stemmed, see below.
+
+    >>> print search_same("shouldn't")
+    FTI data: 'shouldn':1 query: 'shouldn' match: True
+
+Bug #44913 - Unicode characters in the wrong place.
+
+    >>> search_same(u'abc-a\N{LATIN SMALL LETTER C WITH CEDILLA}')
+    "FTI data: 'abc':2 'abc-a\xc3\xa7':1 'a\xc3\xa7':3
+    query: 'abc-a\xc3\xa7' & 'abc' & 'a\xc3\xa7'
+    match: True"
+
+Cut & Paste of 'Smart' quotes. Note that the quotation mark is retained
+in the FTI.
+
+    >>> print search_same(u'a-a\N{RIGHT DOUBLE QUOTATION MARK}')
+    FTI data: 'a-a”':1 'a”':3 query: 'a-a”' & 'a”' match: True
+
+    >>> print search_same(
+    ...     u'\N{LEFT SINGLE QUOTATION MARK}a.a'
+    ...     u'\N{RIGHT SINGLE QUOTATION MARK}')
+    FTI data: 'a’':2 '‘a':1 query: '‘a' & 'a’' match: True
 
 
 Bug #44913 - Nothing but stopwords in a query needing repair
 
-    >>> ftq('a)a')
-    a&a <=> None
+    >>> print search_same('a)a')
+    FTI data:  query: None match: None
 
 
 Stop words (words deemed too common in English to search on) are removed
 from queries by tsearch2.
 
-    >>> ftq("Don't do it harder!")
-    don't&do&it&harder <=> 'harder'
+    >>> print search_same("Don't do it harder!")
+    FTI data: 'harder':5 query: 'harder' match: True
 
 
 Note that some queries will return None after compilation, because they
 contained nothing but stop words or punctuation.
 
-    >>> ftq("don't do it!")
-    don't&do&it <=> None
+    >>> print search_same("don't do it!")
+    FTI data:  query: None match: None
 
-    >>> ftq(",,,")
-    None <=> None
+    >>> print search_same(",,,")
+    FTI data:  query: None match: None
 
 
 Queries containing nothing except whitespace, boolean operators and
 punctuation will just return None.
 
+Note in the fourth example below that the '-' left in the query by _ftq()
+is ignored by to_tsquery().
+
     >>> ftq(" ")
     None <=> None
     >>> ftq("AND")
@@ -353,7 +419,7 @@
     >>> ftq(" AND (!)")
     None <=> None
     >>> ftq("-")
-    None <=> None
+    - <=> None
 
 
 Words are also stemmed by tsearch2 (using the English stemmer).
@@ -381,7 +447,7 @@
     (hi|!hello)&mom <=> ( 'hi' | !'hello' ) & 'mom'
 
     >>> ftq('(hi OR - AND hello) AND mom')
-    (hi|hello)&mom <=> ( 'hi' | 'hello' ) & 'mom'
+    (hi|-&hello)&mom <=> ( 'hi' | 'hello' ) & 'mom'
 
     >>> ftq('hi AND mom AND')
     hi&mom <=> 'hi' & 'mom'
@@ -393,7 +459,7 @@
     (hi|hello)&mom <=> ( 'hi' | 'hello' ) & 'mom'
 
     >>> ftq('() hi mom ( ) ((! |((&)))) :-)')
-    (hi&mom) <=> 'hi' & 'mom'
+    (hi&mom&-) <=> 'hi' & 'mom'
 
     >>> ftq("(hi mom")
     hi&mom <=> 'hi' & 'mom'
@@ -414,15 +480,15 @@
     hi&mom <=> 'hi' & 'mom'
 
     >>> ftq("(foo .") # Bug 43245
-    foo <=> 'foo'
+    foo&. <=> 'foo'
 
     >>> ftq("(foo.")
-    foo <=> 'foo'
+    foo. <=> 'foo'
 
     Bug #54972
 
     >>> ftq("a[a\n[a")
-    ((a&a)|aa)&a <=> 'aa'
+    a[a&[a <=> None
 
     Bug #96698
 
@@ -437,10 +503,10 @@
     Bug #160236
 
     >>> ftq("foo&&bar-baz")
-    foo&((bar&baz)|barbaz) <=> 'foo' & ( 'bar' & 'baz' | 'barbaz' )
+    foo&bar-baz <=> 'foo' & 'bar-baz' & 'bar' & 'baz'
 
     >>> ftq("foo||bar.baz")
-    foo|((bar&baz)|barbaz) <=> 'foo' | ( 'bar' & 'baz' | 'barbaz' )
+    foo|bar.baz <=> 'foo' | 'bar.baz'
 
 
 Phrase Searching
@@ -482,7 +548,8 @@
 
     >>> runsql(r"""
     ...   SELECT title, max(ranking) FROM (
-    ...    SELECT Bug.title,rank(Bug.fti||Message.fti,ftq('firefox')) AS ranking
+    ...    SELECT Bug.title,rank(Bug.fti||Message.fti,ftq('firefox'))
+    ...    AS ranking
     ...    FROM Bug, BugMessage, Message
     ...    WHERE Bug.id = BugMessage.bug AND Message.id = BugMessage.message
     ...       AND (Bug.fti @@ ftq('firefox') OR Message.fti @@ ftq('firefox'))
@@ -499,7 +566,8 @@
     ...       AND BugTask.product = Product.id
     ...       AND Product.name LIKE lower('%firefox%')
     ...    UNION
-    ...    SELECT Bug.title, rank(Product.fti, ftq('firefox')) - 0.3 AS ranking
+    ...    SELECT Bug.title, rank(Product.fti, ftq('firefox')) - 0.3
+    ...    AS ranking
     ...    FROM Bug, BugTask, Product
     ...    WHERE Bug.id = BugTask.bug
     ...       AND BugTask.product = Product.id
@@ -518,7 +586,8 @@
     Printing doesn't work     0.70
 
 
-== Natural Language Phrase Query ==
+Natural Language Phrase Query
+-----------------------------
 
 The standard boolean searches of tsearch2 are fine, but sometime you
 want more fuzzy searches.
@@ -557,7 +626,8 @@
 on Ubuntu) - so we are disabling this and reworking from the ground up.
 
 
-=== nl_term_candidates() ===
+nl_term_candidates()
+~~~~~~~~~~~~~~~~~~~~
 
 To find the terms in a search phrase that are canditates for the search,
 we can use the nl_term_candidates() function. This function uses ftq()
@@ -574,19 +644,16 @@
     >>> nl_term_candidates('how do I do this?')
     []
 
-We also handle expansion of hypenated words (like ftq does):
-
-    >>> nl_term_candidates('firefox foo-bar give me trouble')
-    [u'firefox', u'foo', u'bar', u'foobar', u'give', u'troubl']
-
 Except for the hyphenation character, all non-word caracters are ignored:
 
     >>> nl_term_candidates(
-    ...     "Will the \'\'|\'\' character (inside a ''quoted'' string) work???")
+    ...     "Will the \'\'|\'\' character (inside a ''quoted'' string) "
+    ...     "work???")
     [u'charact', u'insid', u'quot', u'string', u'work']
 
 
-=== nl_phrase_search() ===
+nl_phrase_search()
+~~~~~~~~~~~~~~~~~~
 
 To get the actual tsearch2 query that should be run, you will use the
 nl_phrase_search() function. This one takes two mandatory parameters and
@@ -637,7 +704,8 @@
     u'slow|system'
 
 
-==== Using other constraints ====
+Using other constraints
+.......................
 
 You can pass a third parameter to the function that will be use as
 an additional constraint to determine the total number of rows that
@@ -659,7 +727,8 @@
 
     >>> nl_phrase_search(
     ...     'firefox gets very slow on flickr', Question,
-    ...     "Question.product = %s AND Product.active = 't'" % firefox_product.id,
+    ...     "Question.product = %s AND Product.active = 't'"
+    ...     % firefox_product.id,
     ...     ['Product'], fast_enabled=False)
     u'slow|flickr'
 
@@ -679,7 +748,8 @@
     u'(firefox&flickr&slow)|(flickr&slow)|(firefox&slow)|(firefox&flickr)'
 
 
-==== No keywords filtering with few rows ====
+No keywords filtering with few rows
+...................................
 
 The 50% rule is really useful only when there are many rows. When there
 only very few rows, that keyword elimination becomes a problem since

Follow ups

[Merge] lp:~adeuring/launchpad/bug-29713 into lp:launchpad
From: noreply, 2012-06-21
[Merge] lp:~adeuring/launchpad/bug-29713 into lp:launchpad
From: Abel Deuring, 2012-06-20
[Merge] lp:~adeuring/launchpad/bug-29713 into lp:launchpad
From: Abel Deuring, 2012-06-20
Re: [Merge] lp:~adeuring/launchpad/bug-29713 into lp:launchpad
From: Abel Deuring, 2012-06-20
Re: [Merge] lp:~adeuring/launchpad/bug-29713 into lp:launchpad
From: Raphaël Badin, 2012-06-20
Re: [Merge] lp:~adeuring/launchpad/bug-29713 into lp:launchpad
From: Stuart Bishop, 2012-06-20
Re: [Merge] lp:~adeuring/launchpad/bug-29713 into lp:launchpad
From: Francesco Banconi, 2012-06-20