← Back to team overview

maria-developers team mailing list archive

Benchmarking index_merge sort_intersect

 

Hi!

Please find below results of my attempts to figure out
if/when one should "index_merge_sort_intersection=on" setting.

General setup information:
* Experiment was done on my laptop
* I picked "on time flight statistics" dataset, because 
 - it has lots of columns, so one can't be expected to have all possible composite indexes
 - it allows for meaningful intersection queries
 - the data is real and I don't expect it to be extremely correlated (although some
   aiports/airlines are probably worse than others, so it's not totally
   uncrellated, either).
* Storage engine is XtraDB
* I loaded data for Q1 2009, which is 1.5M rows, and ontime.ibd file is 704M:
MariaDB [ontime]> select count(*), min(FlightDate), max(FlightDate) from ontime;
+----------+-----------------+-----------------+
| count(*) | min(FlightDate) | max(FlightDate) |
+----------+-----------------+-----------------+
|  1578171 | 2009-01-01      | 2009-03-31      |
+----------+-----------------+-----------------+
1 row in set (15.92 sec)

MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=on';

MariaDB [ontime]> explain select count(*) from ontime where depdelay > 30 and OriginState IN ('WA', 'CA');
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
| id | select_type | table  | type  | possible_keys        | key                  | key_len | ref  | rows  | Extra                                                   |
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState,DepDelay | DepDelay,OriginState | 5,3     | NULL | 41832 | Using sort_intersect(DepDelay,OriginState); Using where |
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
1 row in set (0.01 sec)

## (Run query several times until query time stabilizes) 

MariaDB [ontime]> select count(*) from ontime where depdelay > 30 and OriginState IN ('WA', 'CA');
+----------+
| count(*) |
+----------+
|    18824 |
+----------+
1 row in set (2.37 sec)

MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=off';

MariaDB [ontime]> explain select count(*) from ontime where depdelay > 30 and OriginState IN ('WA', 'CA');
+----+-------------+--------+-------+----------------------+----------+---------+------+--------+-----------------------------------------------+
| id | select_type | table  | type  | possible_keys        | key      | key_len | ref  | rows   | Extra                                         |
+----+-------------+--------+-------+----------------------+----------+---------+------+--------+-----------------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState,DepDelay | DepDelay | 5       | NULL | 201894 | Using index condition; Using where; Using MRR |
+----+-------------+--------+-------+----------------------+----------+---------+------+--------+-----------------------------------------------+
1 row in set (0.01 sec)

## (Run query several times until query time stabilizes) 

MariaDB [ontime]> select count(*) from ontime where depdelay > 30 and OriginState IN ('WA', 'CA');
+----------+
| count(*) |
+----------+
|    18824 |
+----------+
1 row in set (18.27 sec)

## For comparison, without indexes whatsoever: 

MariaDB [ontime]> select count(*) from ontime ignore index (depdelay, originstate) where depdelay > 30 and OriginState IN ('WA', 'CA');
+----------+
| count(*) |
+----------+
|    18824 |
+----------+
1 row in set (15.50 sec)

"iostat -x" showed disk utilization as 0% during all of the above listed
queries, so all accessed data was either in cache or in buffer pool.

For an idea about predicate selectivities and correlation:
                               ROWS       FRACTION
whole table                   1578171      100.0%
depdelay>30                    151806        9.6%
OriginState IN ('WA', 'CA')    206576       13.1% 
both                            18824        1.1%

Conclusions:
 - sort-intersect gives about 7x improvement
 - Igor's statement that sort-intersect can't be useful when the load is not
   IO-bound is not supported by the experiment.


My next step was to try with narrower ranges.

I changed the predicates to
 - OriginState IN ('IL', 'VW') (100K rows instead of 200K we had with WA+CA)
 - depdelay > 60   (75 K rows instead of 150K we had with depdelay>30)

MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=on';

MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60  and OriginState IN ('IL', 'VT');
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
| id | select_type | table  | type  | possible_keys        | key                  | key_len | ref  | rows  | Extra                                                   |
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState,DepDelay | DepDelay,OriginState | 5,3     | NULL | 10105 | Using sort_intersect(DepDelay,OriginState); Using where |
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
1 row in set (0.02 sec)

# Run a few times to stabilize 

MariaDB [ontime]> select count(*) from ontime where depdelay > 60  and OriginState IN ('IL', 'VT');
+----------+
| count(*) |
+----------+
|     6283 |
+----------+
1 row in set (0.99 sec)

MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=off';
Query OK, 0 rows affected (0.00 sec)

MariaDB [ontime]> explain select count(*) from ontime where depdelay > 60  and OriginState IN ('IL', 'VT');
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+
| id | select_type | table  | type  | possible_keys        | key      | key_len | ref  | rows  | Extra                                         |
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState,DepDelay | DepDelay | 5       | NULL | 99636 | Using index condition; Using where; Using MRR |
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+
1 row in set (0.02 sec)

# Run a few times to stabilize 

MariaDB [ontime]> select count(*) from ontime where depdelay > 60  and OriginState IN ('IL', 'VT');
+----------+
| count(*) |
+----------+
|     6283 |
+----------+
1 row in set (8.63 sec)

MariaDB [ontime]> select count(*) from ontime ignore index (OriginState,DepDelay) where depdelay > 60  and OriginState IN ('IL', 'VT');
+----------+
| count(*) |
+----------+
|     6283 |
+----------+
1 row in set (14.00 sec)

The picture is similar: sort-intersect is still about 8 times better than
single-index range access plan.
I'm not sure about the conclusions. I suspect that there is some threshold 
(#records to be read through the most selective index)  after which 
sort-intersect becomes worse than using one of the indexes. I wasn't able to 
hit it, though.

Then I've tried changing the OriginState predicate to being simple equality, to
see if we could compare againist ref access:

MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=on';

MariaDB [ontime]> explain select count(*) from ontime  where depdelay > 60  and OriginState='IL';

+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
| id | select_type | table  | type  | possible_keys        | key                  | key_len | ref  | rows  | Extra                                                   |
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState,DepDelay | DepDelay,OriginState | 5,3     | NULL | 10043 | Using sort_intersect(DepDelay,OriginState); Using where |
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+

MariaDB [ontime]> select count(*) from ontime  where depdelay > 60  and OriginState='IL';
+----------+
| count(*) |
+----------+
|     6215 |
+----------+
1 row in set (1.00 sec)

MariaDB [ontime]> set optimizer_switch='index_merge_sort_intersection=off';

MariaDB [ontime]> explain select count(*) from ontime  where depdelay > 60  and OriginState='IL';
+----+-------------+--------+------+----------------------+-------------+---------+-------+--------+------------------------------------+
| id | select_type | table  | type | possible_keys        | key         | key_len | ref   | rows   | Extra                              |
+----+-------------+--------+------+----------------------+-------------+---------+-------+--------+------------------------------------+
|  1 | SIMPLE      | ontime | ref  | OriginState,DepDelay | OriginState | 3       | const | 201600 | Using index condition; Using where |
+----+-------------+--------+------+----------------------+-------------+---------+-------+--------+------------------------------------+
1 row in set (0.01 sec)

MariaDB [ontime]> select count(*) from ontime  where depdelay > 60  and OriginState='IL';
+----------+
| count(*) |
+----------+
|     6215 |
+----------+
1 row in set (0.95 sec)

And the benefit was gone. sort_intersect happens to be just as fast as ref
access.

I've changed OriginState='IL' to  OriginState IN ('IL','ZZ'), where 'ZZ' is the
non existant state code:


MariaDB [ontime]> explain select count(*) from ontime  where depdelay > 60  and OriginState IN ('IL','ZZ');
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+
| id | select_type | table  | type  | possible_keys        | key      | key_len | ref  | rows  | Extra                                         |
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState,DepDelay | DepDelay | 5       | NULL | 99636 | Using index condition; Using where; Using MRR |
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+-----------------------------------------------+

The used index has changed, 

MariaDB [ontime]> select count(*) from ontime  where depdelay > 60  and OriginState IN ('IL','ZZ');
+----------+
| count(*) |
+----------+
|     6215 |
+----------+
1 row in set (9.10 sec)

And the query time went up greatly.  Using FORCE INDEX to use the index that
ref access was using improved the situation slightly:

MariaDB [ontime]> explain select count(*) from ontime force index(originstate) where depdelay > 60  and OriginState IN ('IL','ZZ');
+----+-------------+--------+-------+---------------+-------------+---------+------+--------+-----------------------------------------------+
| id | select_type | table  | type  | possible_keys | key         | key_len | ref  | rows   | Extra                                         |
+----+-------------+--------+-------+---------------+-------------+---------+------+--------+-----------------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState   | OriginState | 3       | NULL | 201601 | Using index condition; Using where; Using MRR |
+----+-------------+--------+-------+---------------+-------------+---------+------+--------+-----------------------------------------------+

MariaDB [ontime]> select count(*) from ontime force index(originstate) where depdelay > 60  and OriginState IN ('IL','ZZ');
+----------+
| count(*) |
+----------+
|     6215 |
+----------+
1 row in set (2.69 sec)

Disabling MRR further improved the situation: 

MariaDB [ontime]> set optimizer_use_mrr='disable';
Query OK, 0 rows affected (0.00 sec)

MariaDB [ontime]> explain select count(*) from ontime force index(originstate) where depdelay > 60  and OriginState IN ('IL','ZZ');
+----+-------------+--------+-------+---------------+-------------+---------+------+--------+------------------------------------+
| id | select_type | table  | type  | possible_keys | key         | key_len | ref  | rows   | Extra                              |
+----+-------------+--------+-------+---------------+-------------+---------+------+--------+------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState   | OriginState | 3       | NULL | 201601 | Using index condition; Using where |
+----+-------------+--------+-------+---------------+-------------+---------+------+--------+------------------------------------+
1 row in set (0.01 sec)

MariaDB [ontime]> select count(*) from ontime force index(originstate) where depdelay > 60  and OriginState IN ('IL','ZZ');
+----------+
| count(*) |
+----------+
|     6215 |
+----------+
1 row in set (1.23 sec)


But it seems we can't put all the blame on MRR. The supposedly best query plan
without MRR is just as bad as it was with MRR:

MariaDB [ontime]> explain select count(*) from ontime  where depdelay > 60  and OriginState IN ('IL','ZZ');
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+------------------------------------+
| id | select_type | table  | type  | possible_keys        | key      | key_len | ref  | rows  | Extra                              |
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState,DepDelay | DepDelay | 5       | NULL | 99636 | Using index condition; Using where |
+----+-------------+--------+-------+----------------------+----------+---------+------+-------+------------------------------------+
1 row in set (0.02 sec)

MariaDB [ontime]> select count(*) from ontime  where depdelay > 60  and OriginState IN ('IL','ZZ');
+----------+
| count(*) |
+----------+
|     6215 |
+----------+
1 row in set (8.96 sec)

In any case, sort_intersect doesn't care about whether one of the predicates is
an equality or IN:

MariaDB [ontime]>  set optimizer_switch='index_merge_sort_intersection=on';
Query OK, 0 rows affected (0.00 sec)

MariaDB [ontime]> explain select count(*) from ontime  where depdelay > 60  and OriginState IN ('IL','ZZ');
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
| id | select_type | table  | type  | possible_keys        | key                  | key_len | ref  | rows  | Extra                                                   |
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
|  1 | SIMPLE      | ontime | range | OriginState,DepDelay | DepDelay,OriginState | 5,3     | NULL | 10043 | Using sort_intersect(DepDelay,OriginState); Using where |
+----+-------------+--------+-------+----------------------+----------------------+---------+------+-------+---------------------------------------------------------+
1 row in set (0.02 sec)

MariaDB [ontime]> select count(*) from ontime  where depdelay > 60  and OriginState IN ('IL','ZZ');
+----------+
| count(*) |
+----------+
|     6215 |
+----------+
1 row in set (0.97 sec)

Conclusions:
I'm not sure if the "failure" to demonstrate speedup against ref access is a
genuine symptom (I think - not). 

Perhaps, enabling all of our improvements at once (BKA, DS-MRR, sort-intersect)
will produce more coherent results, because all of these strategies use
rowid-ordered disk-sweep reads, which allows to make fairer comparison between
them.



BR
 Sergey
-- 
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog



Follow ups