← Back to team overview

maria-developers team mailing list archive

Re: e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin


Hi, Rinat!

On Nov 02, Rinat Ibragimov wrote:
> > > I can't decide. From my point of view, the current approach is
> > > fine. Please pick a variant, and I'll try to implement that.
> >
> > No, I cannot guess which approach will produce more relevant
> > searches. Implement something and then we test what works better
> Variable-length n-grams approach is too innovative, and hard to reason
> about. I've never heard about such an approach, and it doesn't look
> good to me. So I'll stick with a simple slicer.

If you mean that variant where it splits

  "n-grams approach" to "n-gr", "gra", "ram", "ams", "ms a", "s ap", "app", ... 

then it's just "n letters in every chunk" very easy to explain.

But ok, let's start simple and benchmark.

> > Of course, it can. Note that fts_get_word() doesn't generate n-grams
> > either, it gets the whole word and the n-gram plugin later splits it
> > into n-grams. Similarly param->mysql_parse() will extract words for
> > you and you'll split them into n-grams.
> Changed to use param->mysql_parse().
> Turns out that in Aria, MyISAM, and InnoDB, param->mysql_parser() does
> call back param->mysql_add_word(). Is it part of the plugin API?

Yes, it is. E.g. slide 12 from my old presentation:
shows that there are three points where a plugin can add functionality.

* It can extract the text and then call param->mysql_parser(),
  this allows to parse, say, gzip-ed texts or EXIF comments in images.

* It can replace param->mysql_parser(), to use different rules for
  spliting the text into words. This is what the n-gram plugin normally

* It can replace param->mysql_add_word() to post-process every word
  after the built-in parser did the splitting. For example, stemming or
  soundex plugin can do that.

> Comments in include/mysql/plugin_ftparser.h do not mention that at
> all. That's why I initially thought that param->mysql_parse() will
> parse the string like the default parser do, without any ways to
> interact with the process.

I've edited the comment to mention this possibility.
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx