maria-developers team mailing list archive

Thread
Date
Re: e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin

To: Rinat Ibragimov <ri@xxxxxxxxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Wed, 21 Oct 2020 10:27:08 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CAFNvzj7Om06nJ6s_px4=vTCx63K9=dKOTrEYckViRnzJky2WNw@mail.gmail.com>
Hi, Rinat!

On Oct 21, Rinat Ibragimov wrote:
> > > +
> > > +    // Avoid creating partial n-grams by stopping at word boundaries.
> > > +    // Underscore is treated as a word character too, since identifiers in
> > > +    // programming languages often contain them.
> > > +    if (!is_alnum(char_type) && !is_underscore(cs, next, end)) {
> > > +      start= next + char_len;
> > > +      next= start;
> > > +      n_chars= 0;
> > > +      continue;
> >
> > I don't think it's a good idea. On the opposite, I would rather create
> > ngrams over word boundaries, so that, say, I'd split "n-gram plugin" to
> >
> > "n-g", "-gr", "gra", "ram", "am ", "m p", " pl", "plu", "lug", "ugi", "gin".
> 
> I see how this can increase index size: more different strings are
> included. Implementation will become a tiny bit simpler. But can it
> make searching better somehow?

Yes, the idea is that it'll rank

  "One can use n-gram plugin in MariaDB fulltext search"

higher than

  "Plugins are useful for extending functionality of a program"

because some ngrams encode that "n-gram" and "plugin" words must be
together as a phrase.

> I can't decide. From my point of view, the current approach is fine.
> Please pick a variant, and I'll try to implement that.

No, I cannot guess which approach will produce more relevant searches.
Implement something and then we test what works better

> > > +    }
> > > +
> > > +    next += char_len;
> > > +    n_chars++;
> > > +
> > > +    if (n_chars == ngram_token_size) {
> > > +      bool_info->position= static_cast<uint>(start - doc);
> >
> > nope, we don't have MYSQL_FTPARSER_FULL_BOOLEAN_INFO::position and this
> > plugin is definitely not the reason to break the API and add it.
> 
> There is a patch in the patchset that adds the "position" field. Without
> that field,
> it's possible to trigger an assertion in InnoDB's FTS code by executing:
> 
>     INSTALL PLUGIN simple_parser SONAME 'mypluglib';
>     CREATE TABLE articles(
>         a TEXT DEFAULT NULL,
>         b TEXT DEFAULT NULL,
>         FULLTEXT (a, b) WITH PARSER simple_parser
>     ) ENGINE=InnoDB;
>     INSERT INTO articles VALUES ('111', '1234 1234 1234');
>     DROP TABLE articles;
>     UNINSTALL PLUGIN simple_parser;
> 
> Same assertion may be triggered by any fts plugin. Are you sure you
> want that patch to be removed?

I don't think InnoDB assert is a reason to extend the fulltext plugin
API. What about this fix instead:

diff --git a/storage/innobase/fts/fts0fts.cc b/storage/innobase/fts/fts0fts.cc
--- a/storage/innobase/fts/fts0fts.cc
+++ b/storage/innobase/fts/fts0fts.cc
@@ -4669,7 +4669,7 @@ fts_tokenize_add_word_for_parser(
        ut_ad(boolean_info->position >= 0);
        position = boolean_info->position + fts_param->add_pos;
        */
-       position = fts_param->add_pos;
+       position = fts_param->add_pos++;

        fts_add_token(result_doc, str, position);

this fixed the crash for me.

> > > +      n_chars= ngram_token_size - 1;
> > > +      is_first= false;
> > > +    }
> > > +  }
> > > +
> > > +  // If nothing was emitted yet, emit the remainder.
> >
> > what reminder, what is it for?
> 
> It's "remainder", with "a" before "i". Remainder of the processed
> string. That way strings that are too short for having even a single
> n-gram, will still generate some index entries and will be
> discoverable by exact queries.

Perhaps the comment should say it? That if the string too short, it'll
be emitted here?

> > > +    case MYSQL_FTPARSER_FULL_BOOLEAN_INFO:
> > > +      // Reusing InnoDB's boolean mode parser to handle all special characters.
> > > +      while (fts_get_word(cs, &start, end, &word, &bool_info)) {
> >
> > nope. InnoDB might not even be compiled it. Use param->mysql_parse()
> 
> Copied fts_get_word() implementation into the plugin code.
> 
> As far as I understand, param->mysql_parse() cannot be used here as it
> splits a string into words like default fulltext search does. This
> plugin however must generate n-grams.

Of course, it can. Note that fts_get_word() doesn't generate n-grams
either, it gets the whole word and the n-gram plugin later splits it
into n-grams. Similarly param->mysql_parse() will extract words for you
and you'll split them into n-grams.

Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx
References

Re: e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin
From: Sergei Golubchik, 2020-10-19