ubuntu-translations-coordinators team mailing list archive
-
ubuntu-translations-coordinators team
-
Mailing list archive
-
Message #01710
[Bug 744914] Re: transliterate text/use collation before adding to xapian db and when searching
** Changed in: software-center (Ubuntu Precise)
Status: Confirmed => Triaged
--
You received this bug notification because you are a member of Ubuntu
Translations Coordinators, which is subscribed to Ubuntu Translations.
https://bugs.launchpad.net/bugs/744914
Title:
transliterate text/use collation before adding to xapian db and when
searching
Status in Ubuntu Translations:
New
Status in “software-center” package in Ubuntu:
Triaged
Status in “software-center” source package in Precise:
Triaged
Bug description:
Binary package hint: software-center
As of now software center uses str.lower() when searching in the
xapian db:
utils/query.py
22: s = search_term.lower()
33: query = xapian.Query(str_to_prefix[search_prefix]+search_term.lower())
There are two problems with this:
* many languages have diacritic marks for characters but for fast typing users usually write the base character: (in Romanian: ăâșțî and ĂÂȘȚÎ are spelled AASTI by some users).
* characters in the Unicode set can appear in two forms: composed and
decomposed: the character U+00C7 (LATIN CAPITAL LETTER C WITH
CEDILLA) can also be expressed as the sequence U+0327 (COMBINING
CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
To solve both problems both the text entered in the xapian db and the
user's text query must be normalized.
The search function in Chromium uses ICU rules to achieve this:
- http://code.google.com/p/chromium/issues/detail?id=1100
- http://www.google.com/codesearch/p?hl=en#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/editing/TextIterator.cpp&q=file:TextIterator.cpp&l=1882
There is a python-icu library that could help achieve this. See for
example http://lists.osafoundation.org/pipermail/pyicu-
dev/2010-October/000214.html
Or one could just remove the diacritical marks from the string
altogether: http://stackoverflow.com/questions/517923/what-is-the-
best-way-to-remove-accents-in-a-python-unicode-string
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-translations/+bug/744914/+subscriptions