maria-developers team mailing list archive

Thread
Date

Eliminating or replacing the InnoDB row_prebuilt_t::fetch_cache

To: "MariaDB Developers (maria-developers@xxxxxxxxxxxxxxxxxxx)" <maria-developers@xxxxxxxxxxxxxxxxxxx>
From: Marko Mäkelä <marko.makela@xxxxxxxxxxx>
Date: Fri, 12 Jan 2018 10:27:58 +0200
Cc: Sergei Petrunia <sergey.petrunya@xxxxxxxxxxx>, Vicențiu Ciorbaru <vicentiu@xxxxxxxxxxx>, Igor Babaev <igor@xxxxxxxxxxx>, Oleksandr Byelkin <sanja@xxxxxxxxxxx>

The handler API in MySQL and MariaDB is operating on a single record at a
time. There are some ‘batched’ primitives that in my understanding were
developed for the NDB cluster, but generally InnoDB does not use them.

The only nontrivial thing that InnoDB does during reads is the Index
Condition Pushdown (ICP) and the end-of-range detection (DsMrr) that was
ported from the never-released MySQL 6.0 or 5.2 to MySQL 5.6.

To improve the performance of range scans, InnoDB uses a caching mechanism
to prefetch results. Starting with the 4th consecutive read on the same
ha_innobase handle, InnoDB would prefetch 8 records into
row_prebuilt_t::fetch_cache. There are several drawbacks of this:

   - The parameters of this cache are hard-coded, and these caches can make
   wrong guesses (for SELECT * FROM t LIMIT 5, it would unnecessarily read 8
   records, possibly causing unnecessary page accesses)
   - The caches waste memory, especially on partitioned tables, where each
   partition handle would have its own cache. (The main benefit of ha_innopart
   or InnoDB native partitioning in MySQL 5.7 is that the fetch_cache of the
   table is shared between partitions. It is only a benefit of saving memory,
   because the cache will be emptied when switching partitions.)

As far as I understand, the entire purpose of the prebuilt->fetch_cache is
to reduce the cost of acquiring the page latch and repositioning the cursor
on each read request. If we did not call mtr_t::commit() between each
request, maybe we could remove this cache altogether.

When I was in the InnoDB team at MySQL, I did bring up a few times the idea
of eliminating the fetch_cache, but the optimizer developers were in a
separate team, and to do this, we would have needed support from both
managers. I am hoping that with MariaDB this would merely be a technical
challenge to first write a proof-of-concept prototype, and then to polish
and test it.

When I discussed this with the late Olav Sandstå some years ago, there was
some thinking that we should stop hard-coding the TABLE::record[0] in
various parts of the code, and the storage engine should be able to return
multiple records to a larger TABLE::record[] buffer. I wonder if a simpler
approach would work, and if we could do without any prefetching:

   1. Remove row_sel_dequeue_cached_row_for_mysql(), prebuilt->fetch_cache,
   and related code.
   2. Add prebuilt->mtr and let it be reused across multiple subsequent
   calls to row_search_mvcc(), until the query executor stops the current read
   batch.
   3. A call to btr_pcur_store_position() and mtr_t::commit() will be
   necessary before any potentially long delay, such as when the result is
   being sent to a client connection, or when the query executor is going to
   switch to read records from another table.
   4. The "end of read batch" could be signalled by invoking
   handler::extra() with a parameter.

With such changes, less copying would take place, and there should be less
contention on dict_index_t::lock and the upper-level buf_block_t::lock, due
to fewer calls to btr_pcur_restore_position(). While that function does
have a fast path ("optimistic restore"), it cannot switch to the next page
without a key lookup. Range scans would be more efficient if we can always
simply advance to the right leaf page and release the lock on the left page.

I would love to do this in InnoDB. But I think that the effort should be
lead from the optimizer side. Someone should first implement the "end of
batched read" calls and some DBUG_ASSERT() that would catch any
wrong-doing. That is, whenever we start to send data to a client or start
to write into a temporary table, or switch to read records from another
cursor, we must assert that the "end of batched read" flag is set on all
cursor handles. Similarly, at the end and start of a statement, and at
commit or rollback, all cursor handles must be at "end of batched read".

Who would take the challenge?

Best regards,

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB Corporation

DON’T MISS

M|18

MariaDB User Conference

February 26 - 27, 2018

New York City

https://m18.mariadb.com/