maria-developers team mailing list archive

Thread
Date

Re: Regarding approach used for Early NULL filtering in the range optimizer(MDEV-15777)

To: Igor Babaev <igor@xxxxxxxxxxx>
From: Sergey Petrunia <sergey@xxxxxxxxxxx>
Date: Mon, 27 May 2019 19:01:53 +0300
Cc: Varun Gupta <varun.gupta@xxxxxxxxxxx>, maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <a7abcad3-46ae-6fae-8a82-ccf4987f6763@mariadb.com>
User-agent: Mutt/1.9.4 (2018-02-28)

On Fri, May 24, 2019 at 04:39:51PM -0700, Igor Babaev wrote:
> On 05/24/2019 01:41 PM, Sergey Petrunia wrote:
> > On Fri, May 24, 2019 at 10:08:04AM -0700, Igor Babaev wrote:
> > > On 05/24/2019 04:03 AM, Varun Gupta wrote:
> > > > Hi Igor,
> > > > After discussing with Sergey, we came up with these conclusions as to
> > > > why we used the approach of going through all the keyuses in the KEYUSE
> > > > array
> > > > 
> > > > Cases to consider:
> > > > we have an index on column a
> > > > 
> > > > 1) a OP const
> > > > where OP can be (</<=/=/between/ etc on which we do range analysis) .
> > > > This case is already handled by the range optimizer. We do not create a
> > > > NULL rejecting predicate for such conditions.
> > > > 
> > > > 2) eq_join conditions (tbl1.a = tbl2.col2)
> > > > This is the specific case we have tried to implement in MDEV-15777, we
> > > > create NULL rejecting predicates for the keyuses for each table and then
> > > > feed these predicates to range optimizer
> > > There is no logical explanation for checking null rejection of fields used
> > > in equalities with the help of array keyuse.
> > The logic here is as follows:
> > 
> > We already collect the set of KEYUSE::null_rejecting attributes. Its
> > meaning is very close to what the patch needs.
> The meaning of this argument is to be used a key value for ref access by an
> index i1.
> You need range access by a different index i2. There is no logical
> connection
> between these two indexes. The usage of i1 can be blocked by IGNORE clause
> and this case
> there won't be any keyuses for it. But the equality from which the
> null-rejection can be
> deduced is still there. You need range scan to reduce the number of records
> produced
> by the left operand of a join operation. And you should not care weather it
> should be the
> argument of any ref access.
> Here is the situation:
> The field of interest for range access is t1.a and you have two conditions
>    t1.a=t2.a with index i1 on t2.a
>    t1.b=t3.b with index i2 on t3.b
> The use prohibited to use i1. t1.a apparently remains null-rejecting. Yet it
> cannot be not detected
> as such by a look through array of keyuses.

I don't follow the logic. What MDEV-15777 does is to basically look at KEYUSE
object, which represent

  tbl.key_col = other_column

and feed this into the range optimizer:
  
  tbl.key_col IS NOT NULL

If the query CAN use a certain index:
- we will have a KEYUSE object for it
- we will feed the IS NOT NULL into the range optimizer.

if the query CANNOT use a certain index (because of IGNORE INDEX), then
- we won't have KEYUSE object for that index (and so won't construct an
  IS NOT NULL condition for it)
- range optimizer will not build ranges for it (so it doesn't really matter 
  if IS NOT NULL was constructed or not)

Can you provide a full example?

> > 
> > Because of this, the part of the patch for MDEV-15777 which analyzes the KEYUSE
> > array is very small: 122 lines including the comment (I counted
> > make_null_rejecting_conds() and add_cond()).
> > 
> > I think, the CPU overhead is small, too.
> > 
> > > > 3) a < func(b,c)
> > > > we do not handle this case, because:
> > > > 
> > > > a) It is harder to process. We would need to process the whole WHERE
> > > > clause to infer
> > > > non-nullability of the condition. Example:
> > > > 
> > > >    (t1.f2+1 < t1.f1+t1.f3) OR ... OR t1.f1 is null
> > > > 
> > > > here the left part of the OR condition doesn't allow f1 to be NULL, but
> > > > the
> > > > right condition allows it. We would also need to take into account tables
> > > > inside outer joins. We would need to go through Item classes and add code
> > > > which say "Item_xxx() will compute to false when its argument Y is null"
> > > > (there is item->not_null_tables() currently, but not not_null_columns()).
> > > Have you actually looked at the implementations of the virtual function
> > > not_null_tables()?
> > > Do you need me to provide the implementation of the virtual function
> > > not_null_columns()?
> > It will require multiple implementations. At the moment we have:
> > 
> > nm mysqld --demangle | grep 'Item.*::not_null_tables'  | wc -l
> > 21
> > 
> > Another question: do you intend to collect not_null_columns() for all columns,
> > or just columns that are a part of some index?
> > 
> > What would a good data structure to store a set of tablename.column_name ?
> It will be just a bitmap created for each table that can be used for many
> other purposes.
> If you want to filter out those that are not part of any index it can be
> easily done
> (even once for any table share)

My point was that constructing such a bitmap is a lot heavier than the
not_null_table_map().

table.h declares a field bitmap class:

  typedef Bitmap<MAX_FIELDS> Field_map;

but it's huge:

(gdb) p sizeof(Field_map)
  $52 = 544

and note that we will not be able to just put one Field_map into the TABLE
object:

If we think about how to compute not_null_columns() for Item_cond_and or
Item_cond_or:

- Item_cond_and will compute a union of all its arguments' not_null_columns().
- Item_cond_or will compute an intersection of all its arguments'
  not_null_columns().
- Other items have different logic.

that is, we will need to keep multiple "set of fields in tables" objects
around. Just one bitmap attached to the table will not suffice.

This is doable, but it's going to be more code and have more CPU overhead.

And I'm still not convinced why we should do this. 

It is not clear if this will fit the scope of MDEV-15777
and once we got it done, what's the use of it? Is this something other query
optimizers do (so the users expect it?)?

> > 
> > > > b) the conditions in form of arbitrary functions are not as frequently
> > > > used as
> > > > ref access conditions.
> > > > 
> > > > to sum up a) and b) - doable but will require a lot of effort.
> > > > 
> > > > Do you have any specific practically relevant example in mind that we
> > > > should
> > > > handle?

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog

References

Regarding approach used for Early NULL filtering in the range optimizer(MDEV-15777)
From: Varun Gupta, 2019-05-24
Re: Regarding approach used for Early NULL filtering in the range optimizer(MDEV-15777)
From: Sergey Petrunia, 2019-05-24