maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #12426
Re: e3f45b2f9ea: MDEV-10267 Add ngram fulltext parser plugin
Hi, Rinat!
On Oct 21, Rinat Ibragimov wrote:
> > > +
> > > + // Avoid creating partial n-grams by stopping at word boundaries.
> > > + // Underscore is treated as a word character too, since identifiers in
> > > + // programming languages often contain them.
> > > + if (!is_alnum(char_type) && !is_underscore(cs, next, end)) {
> > > + start= next + char_len;
> > > + next= start;
> > > + n_chars= 0;
> > > + continue;
> >
> > I don't think it's a good idea. On the opposite, I would rather create
> > ngrams over word boundaries, so that, say, I'd split "n-gram plugin" to
> >
> > "n-g", "-gr", "gra", "ram", "am ", "m p", " pl", "plu", "lug", "ugi", "gin".
>
> I see how this can increase index size: more different strings are
> included. Implementation will become a tiny bit simpler. But can it
> make searching better somehow?
Yes, the idea is that it'll rank
"One can use n-gram plugin in MariaDB fulltext search"
higher than
"Plugins are useful for extending functionality of a program"
because some ngrams encode that "n-gram" and "plugin" words must be
together as a phrase.
> I can't decide. From my point of view, the current approach is fine.
> Please pick a variant, and I'll try to implement that.
No, I cannot guess which approach will produce more relevant searches.
Implement something and then we test what works better
> > > + }
> > > +
> > > + next += char_len;
> > > + n_chars++;
> > > +
> > > + if (n_chars == ngram_token_size) {
> > > + bool_info->position= static_cast<uint>(start - doc);
> >
> > nope, we don't have MYSQL_FTPARSER_FULL_BOOLEAN_INFO::position and this
> > plugin is definitely not the reason to break the API and add it.
>
> There is a patch in the patchset that adds the "position" field. Without
> that field,
> it's possible to trigger an assertion in InnoDB's FTS code by executing:
>
> INSTALL PLUGIN simple_parser SONAME 'mypluglib';
> CREATE TABLE articles(
> a TEXT DEFAULT NULL,
> b TEXT DEFAULT NULL,
> FULLTEXT (a, b) WITH PARSER simple_parser
> ) ENGINE=InnoDB;
> INSERT INTO articles VALUES ('111', '1234 1234 1234');
> DROP TABLE articles;
> UNINSTALL PLUGIN simple_parser;
>
> Same assertion may be triggered by any fts plugin. Are you sure you
> want that patch to be removed?
I don't think InnoDB assert is a reason to extend the fulltext plugin
API. What about this fix instead:
diff --git a/storage/innobase/fts/fts0fts.cc b/storage/innobase/fts/fts0fts.cc
--- a/storage/innobase/fts/fts0fts.cc
+++ b/storage/innobase/fts/fts0fts.cc
@@ -4669,7 +4669,7 @@ fts_tokenize_add_word_for_parser(
ut_ad(boolean_info->position >= 0);
position = boolean_info->position + fts_param->add_pos;
*/
- position = fts_param->add_pos;
+ position = fts_param->add_pos++;
fts_add_token(result_doc, str, position);
this fixed the crash for me.
> > > + n_chars= ngram_token_size - 1;
> > > + is_first= false;
> > > + }
> > > + }
> > > +
> > > + // If nothing was emitted yet, emit the remainder.
> >
> > what reminder, what is it for?
>
> It's "remainder", with "a" before "i". Remainder of the processed
> string. That way strings that are too short for having even a single
> n-gram, will still generate some index entries and will be
> discoverable by exact queries.
Perhaps the comment should say it? That if the string too short, it'll
be emitted here?
> > > + case MYSQL_FTPARSER_FULL_BOOLEAN_INFO:
> > > + // Reusing InnoDB's boolean mode parser to handle all special characters.
> > > + while (fts_get_word(cs, &start, end, &word, &bool_info)) {
> >
> > nope. InnoDB might not even be compiled it. Use param->mysql_parse()
>
> Copied fts_get_word() implementation into the plugin code.
>
> As far as I understand, param->mysql_parse() cannot be used here as it
> splits a string into words like default fulltext search does. This
> plugin however must generate n-grams.
Of course, it can. Note that fts_get_word() doesn't generate n-grams
either, it gets the whole word and the n-gram plugin later splits it
into n-grams. Similarly param->mysql_parse() will extract words for you
and you'll split them into n-grams.
Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx
References