maria-developers team mailing list archive

Thread
Date
Re: index_merge tasks summary

To: Sergey Petrunya <psergey@xxxxxxxxxxxx>
From: Michael Widenius <monty@xxxxxxxxxxxx>
Date: Wed, 3 Jun 2009 01:41:17 +0300
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20090527191723.GG12138@pslp>
Reply-to: monty@xxxxxxxxxxxx
Hi!

I cc:d this to the maria-developers list as there are probably others
that are interested in this discussion.
(I removed all references to the customer who is interested in getting
this solved)

>>>>> "Sergey" == Sergey Petrunya <psergey@xxxxxxxxxxxx> writes:

Sergey> Hi Monty,
Sergey> Please find below a summary of index_merge problems and suggestions on how
Sergey> we could address them, together with concerns:

Sergey> 1. index_merge/intersect is not used for range access scans. 
Sergey>    Described as WL#21. Seems to be a big task. 

I extended the high level definition a bit to make the text more clear.
Any preliminary estimate you could do (in weeks)?

Sergey>    One big issue is that wether we'll be able to make the optimizer make
Sergey>    good choices. There are two concerns
Sergey>    - Bad estimates/poor cost model
Sergey>    - Correlation(s). When we consider index_merge/intersect for 
Sergey>        t.key1<c1 AND t.key2<c2
Sergey>      we have no way to know whether the conditions are
Sergey>        = always satisfied together (and thus there is no point to use index_merge)
Sergey>        = never satisified together
Sergey>        = have no correlation
Sergey>      this will cause our estimates be inherently poor. Will they be
Sergey>      satisfactory? In case they won't, should we provision for adding hints
Sergey>      or something else?

I think that we should assume that there is always notable less rows to
retrieve when you can do an intersection than if we can't, if the
sizes of the sets are of same magnitude.

We should always do index merge (except if disabled by a hint) if the
number or rows matching the conditions are relatively small for all
involved keys.

After all, for normal size keys, we will get +100 keys for each key
block we read.  If we can eliminate a couple of rows for each block we
read we will probably gain speed.

I talked with Igor about having live statistics in memory for the
result of index merge.  If we could keep for each index combination
the number of rows found for each index and the number of rows
eliminated, we could quite soon know which index merge makes sense.
(This is an idea for the future).

I had some ideas to improve the temporary table solution; we can
discuss these separately.

Sergey> 2. Possible range access disables index_merge/[sort]union scans
Sergey>    Described as WL#24. I've posted a fix suggestion there. I'm not sure if 
Sergey>    it will work to customers complete satisfaction:
Sergey>     - whether the fix will handle all kinds of WHERE clauses he needs
Sergey>     - what will happen when we enable cost-based choice between range access
Sergey>       and index_merge/intersect (will there be poor plan choices due to
Sergey>       wrong cost calculations?)

If you can add to the worklog what kind of WHERE we will be able to
optimize with your sugestions, we can then ask the customer to verify
if that is enough for him as a first step.

Sergey> 3. index_merge/intersect optimization is poor.
Sergey>    Problems described in WL#26.  At the moment I have only vague idea which 
Sergey>    direction we need to move to improve it.
Sergey>    We could try grabbing low-hanging fruits, like 
Sergey>     = make sure index_merge/intersect is not picked when the range is 
Sergey>       available. 

Wouldn't this cause more problems like described in #2 ?

Please add some example WHERE clauses to the worklog to make it clear
exactly what you mean.

Sergey>     = change the the process of choosing best index_merge/intersect plan so
Sergey>       that it doesn't construct apparently useless (e.g. redundant) plans.

Do you mean disregarding some index from the index_merge plan that
are covered by other index?

I think this is an obvious fix to make early.

Sergey>    Or we could try re-working cost calculations so that the above is
Sergey>    automatically taken care of and doesnt happen. The problem with this is
Sergey>    that it's hard to estimate when we'll get at acceptable result then.

Lets start with the obvious and only when needed start thinking about
recalculating costs as these can easily lead to the some old working
queries are suddenly slower than before...

Sergey> It seems the first logical thing to do is #2. 

Agree.

Sergey> Then we could pick between #1 and #3. #1 and #3 are related in some way as
Sergey> index_merge/intersect can make more assumptions when it optimizes for ROR
Sergey> scans only. On the other hand, Customer mentioned that if we fix #1, #3 will
Sergey> have easier job as he'll be able to remove all the multi-column indexes he
Sergey> had to create to be able to get ROR scans for every WHERE clause he has.

Agree that fixing #1 is the next logical step.

Lets discuss all these worklogs on IRC on Thursday and then decide in
which order and how we should do things.

Regards,
Monty