maria-developers team mailing list archive

Thread
Date

Re: Performance problem: need index-based access when have only non-sargable conditions

To: Sergey Petrunya <psergey@xxxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxxx>
Date: Wed, 13 Oct 2010 15:12:09 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20101010075623.GB26074@pslp2>
User-agent: Mutt/1.5.16 (2007-06-09)

Hi, Sergey!

On Oct 10, Sergey Petrunya wrote:
> 
> I was asked about this performance problem: consider a query 
(so was I, so was Igor, and may be others :)

>   select * from tbl where tbl.key like '%foo%'
>   
> Table records are big, much larger than the length of the key. LIKE condition
> is very selective, however its pattern starts with '%', so we can't construct
> range access (and they are really searching for matches in word
> suffixes/infixes, so fulltext index isn't an option).
> 
> The desired execution strategy is:
> 
> - Scan the index on tbl.key and collect index tuples that match the LIKE 
>   condition.
> - For those index tuples, retrieve full records.
> 
> One can simulate it by rewriting the query as a self join:
> 
>   select * from tbl A, tbl B 
>   where 
>     B.primary_key = A.primary_key AND B.key LIKE '%foo%';
> 
> However, in this particular case they have a problem doing such rewrites
> because the queries are generated by some ORM tool.
> 
> Proposed solution from our side
> -------------------------------
> It is not difficult to add an execution strategy like this
> 
>   get_next_record()
>   {
>     read index tuple;
>     if (condition satisified)
>     {
>       read full record
>     }
>   }
> (with optional rowid-ordered retrieval for full records). 
> 
> The difficult part is to find a way to decided whether this strategy
> should be used.  We have no meaningful selectivity estimates for
> 'column LIKE '%foo%'.
> 
> The only way we could achieve this is by having the user manually
> specify condition selectivities. I can see one way to achieve this
> without getting into painful syntax modifications: 
> 
>   select * from tbl where selectivity(tbl.key like '%foo%', 0.05)
> 
> here 'selectivity(X,y)' will evaluate to value of X, but will also
> inform the optimizer that P(X is true)=y.

I don't like the manual selectivity() "function".
Igor was mentioning random dives to estimate the selectivity.
That could work.

Another thought - for InnoDB (and other engines that use primary key as
a rowid and convert table scan to an index scan) your execution strategy
is practically never worse than reading full rows right away. That is
for these engines we can always use your execution strategy without any
selectivity estimates.

Additionally, we could've make FORCE INDEX to force this strategy too -
not nice, but at least it would avoid syntax extensions. Besides, think
of it, it would be the "natural" behavior, as the user has already tried
FORCE INDEX to achieve this effect, that is, FORCE INDEX is expected to
work here.

So, the minimal change could be - always use this strategy for InnoDB
(and some other engines), never use it automatically for MyISAM (and
some other engines), always use it if FORCE INDEX was specified.

Regards,
Sergei

References

Performance problem: need index-based access when have only non-sargable conditions
From: Sergey Petrunya, 2010-10-10