← Back to team overview

enterprise-support team mailing list archive

[Bug 1679135] [NEW] Ignore INNODB_FT_DEFAULT_STOPWORD for ngram indexes

 

Public bug reported:

Originally reported at https://bugs.mysql.com/bug.php?id=84420


[5 Jan 11:19] Miguel Angel Nieto

Description:
Ngram indexes also check the stopwords list, to see if any indexed element *contain* one of the words on that list. This looks good and it is the normal behaviour, but I don't think that the default table is suitable to use it with ngram.

For example, any item that contains 'a' or 'i' will be ignored. So for
example, if you have word "east", you cannot search for "ea" because it
has been ignored.

Ngram should have a different default list of stopwords, or an empty
list.

How to repeat:
mysql> CREATE TABLE `articles` ( 
`id` int(10) unsigned NOT NULL AUTO_INCREMENT, 
`body` text, 
PRIMARY KEY (`id`), 
FULLTEXT KEY `ftx` (`body`) /*!50100 WITH PARSER `ngram` */ 
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

mysql> insert into articles (body) values ('east'); 
mysql> insert into articles (body) values ('east area'); 
mysql> insert into articles (body) values ('east job'); 
mysql> insert into articles (body) values ('eastnation'); 
mysql> insert into articles (body) values ('eastway, try try');

mysql> SELECT * FROM articles WHERE MATCH(body) AGAINST('ea' IN BOOLEAN MODE); 
Empty set (0.00 sec)

====

There is a workaround for this bug: create custom
INNODB_FT_DEFAULT_STOPWORD table for ngram indexes. But issue with this
workaround is that such a table used by other fulltext indexes, such as
mecab.

Suggested fix: either have special INNODB_FT_DEFAULT_STOPWORD table for
ngram indexes or ignore it at all.

There is also code in fts_check_token:

4791 bool
4792 fts_check_token(
4793     const fts_string_t*     token,
4794     const ib_rbt_t*         stopwords,
4795     bool                is_ngram,
4796     const CHARSET_INFO*     cs)
4797 {
4798     ut_ad(cs != NULL || stopwords == NULL);
4799 
4800     if (!is_ngram) {
4801         ib_rbt_bound_t  parent;
4802 
4803         if (token->f_n_char < fts_min_token_size
4804             || token->f_n_char > fts_max_token_size
4805             || (stopwords != NULL
4806             && rbt_search(stopwords, &parent, token) == 0)) {
4807             return(false);
4808         } else {
4809             return(true);
4810         }
4811     }
4812 
4813     /* Check token for ngram. */
4814     DBUG_EXECUTE_IF(
4815         "fts_instrument_ignore_ngram_check",
4816         return(true);
4817     );

So only job is to replace DBUG_EXECUTE_IF with some new option.

** Affects: mysql-server
     Importance: Unknown
         Status: Unknown

** Affects: percona-server
     Importance: Undecided
         Status: Confirmed

** Affects: percona-server/5.5
     Importance: Undecided
         Status: Invalid

** Affects: percona-server/5.6
     Importance: Undecided
         Status: Invalid

** Affects: percona-server/5.7
     Importance: Undecided
         Status: Confirmed


** Tags: i180635

** Also affects: percona-server/5.5
   Importance: Undecided
       Status: New

** Also affects: percona-server/5.7
   Importance: Undecided
       Status: Confirmed

** Also affects: percona-server/5.6
   Importance: Undecided
       Status: New

** Changed in: percona-server/5.6
       Status: New => Invalid

** Changed in: percona-server/5.5
       Status: New => Invalid

** Bug watch added: MySQL Bug System #84420
   http://bugs.mysql.com/bug.php?id=84420

** Also affects: mysql-server via
   http://bugs.mysql.com/bug.php?id=84420
   Importance: Unknown
       Status: Unknown

-- 
You received this bug notification because you are a member of Ubuntu
Server/Client Support Team, which is subscribed to MySQL.
Matching subscriptions: Ubuntu Server/Client Support Team
https://bugs.launchpad.net/bugs/1679135

Title:
  Ignore INNODB_FT_DEFAULT_STOPWORD for ngram indexes

To manage notifications about this bug go to:
https://bugs.launchpad.net/mysql-server/+bug/1679135/+subscriptions


Follow ups