maria-developers team mailing list archive
Mailing list archive
Re: e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin
On Nov 02, Rinat Ibragimov wrote:
> > > I can't decide. From my point of view, the current approach is
> > > fine. Please pick a variant, and I'll try to implement that.
> > No, I cannot guess which approach will produce more relevant
> > searches. Implement something and then we test what works better
> Variable-length n-grams approach is too innovative, and hard to reason
> about. I've never heard about such an approach, and it doesn't look
> good to me. So I'll stick with a simple slicer.
If you mean that variant where it splits
"n-grams approach" to "n-gr", "gra", "ram", "ams", "ms a", "s ap", "app", ...
then it's just "n letters in every chunk" very easy to explain.
But ok, let's start simple and benchmark.
> > Of course, it can. Note that fts_get_word() doesn't generate n-grams
> > either, it gets the whole word and the n-gram plugin later splits it
> > into n-grams. Similarly param->mysql_parse() will extract words for
> > you and you'll split them into n-grams.
> Changed to use param->mysql_parse().
> Turns out that in Aria, MyISAM, and InnoDB, param->mysql_parser() does
> call back param->mysql_add_word(). Is it part of the plugin API?
Yes, it is. E.g. slide 12 from my old presentation:
shows that there are three points where a plugin can add functionality.
* It can extract the text and then call param->mysql_parser(),
this allows to parse, say, gzip-ed texts or EXIF comments in images.
* It can replace param->mysql_parser(), to use different rules for
spliting the text into words. This is what the n-gram plugin normally
* It can replace param->mysql_add_word() to post-process every word
after the built-in parser did the splitting. For example, stemming or
soundex plugin can do that.
> Comments in include/mysql/plugin_ftparser.h do not mention that at
> all. That's why I initially thought that param->mysql_parse() will
> parse the string like the default parser do, without any ways to
> interact with the process.
I've edited the comment to mention this possibility.
VP of MariaDB Server Engineering