← Back to team overview

maria-developers team mailing list archive

Re: GSoC 2013


Hi, Dammina!

On Apr 17, Dammina Sahabandu wrote:
> Hi All,
> I'm Dammina Sahabandu, 3rd year undergraduate of University of
> Moratuwa(dept. of Computer of Science and Engineering) which has
> consecutively won GSOC’s most number of participant’s award from 2009 to
> 2012. I'm willing to contribute for a MariaDB project in this years GSoC. I
> went through the project ideas page and found the project "Self-Tuning
> Optimizer" interesting. So can you give me an insight of the project.

Great! I hope, you've read https://mariadb.atlassian.net/browse/MDEV-350

So, this is the idea. The optimizer code (in the sql_select.cc, for
example) contains lines like

        best_time= tmp + records/(double) TIME_FOR_COMPARE;

(where TIME_FOR_COMPARE is simply 5) or

        double keys_per_block= (stats.block_size/2.0/len+1);


       (stats.data_file_length) / IO_SIZE + 2

(where IO_SIZE is 4096).

These magic constants are not universal, they should be different for
different engines, different for SSD, etc. This task is about replacing
them all with variables, like

        best_time= tmp + records * time_for_compare;
        tmp= engine_factor * table->file->keyread_time(...)

Then we measure the query execution time (T), number or rows read from
every table, etc. And we split the time, approximately, between those,
like t1 seconds were spent reading N1 rows from the first table, t2
seconds - reading N2 rows from the second table. t3 seconds - doing
N1*N2 comparisons. And we've had cost estimates - optimizer requests
them from the engine before executing the query - C1, C2.

We assume that engine estimates are perfect - that is, proportional to
the actual time, spent on retrieval.

  t1 = engine_factor * C1
  t2 = engine_factor * C2
  t3 = N1 * N2 * time_for_compare
  T = t1 + t2 + t3

One cannot solve the above directly, too many unknowns
(lowercase), but we can collect the data from more than one query.

This way we collect statistics and extract values for engine_factor
(which is different for different storage engines), and time_for_compare
(which is the the same for the whole server). And we feed these values
back to the optimizer, that can use them to generate more adequate
execution plans.

This is a continuous process - statistics is always collected while the
server is running, and periodically optimizer constants are updated.