maria-developers team mailing list archive

Thread
Date

Re: Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE

To: Sachin <sachin.setiya@xxxxxxxxxxx>
From: Nikita Malyavin <nikitamalyavin@xxxxxxxxx>
Date: Sat, 22 May 2021 16:03:31 +0300
Cc: "MariaDB Developers \(maria-developers@xxxxxxxxxxxxxxxxxxx\)" <maria-developers@xxxxxxxxxxxxxxxxxxx>, Andrei Elkin <andrei.elkin@xxxxxxxxxxx>
In-reply-to: <87pmxmn1a1.fsf@quad>

Welcome to the thread, Andrei!

For everybody: we just had a productive chat with Andrei, and I'd like to
outline the results.


Let's take a look at binlog_commit, or more exactly
MYSQL_BIN_LOG::trx_group_commit_leader
called from there (follow binlog_commit_flush_trx_cache
 -> THD::binlog_flush_pending_rows_event
 -> MYSQL_BIN_LOG::flush_and_set_pending_rows_event
 -> MYSQL_BIN_LOG::write_transaction_to_binlog call chain).

The committing is currently strongly sequenced. The transactions are
organized into
groups, and when the latest transaction is acknowledged on the first commit
phase,
they all are being committed by leader in a chosen order.

However, even here we can write the transactions in parallel, preserving
the order.
Andrei also claims, that the order can be potentially restored on the
replication side.

Anyway we technically can't send the transaction to the replication slave
before the
binlog flush&fsync, nevertheless the data will still be preserved in the
volatile append cache.

There was MDEV-20925 <https://jira.mariadb.org/browse/MDEV-20925> to store
the transaction length in the event, but it unfortunately
was rejected:
> I will be closing this issue because we have COMMIT/ROLLBACK query log
event
> in the end of transaction , whose size is difficult to determine ,
> So current plan is to do MDEv-19687 without transaction length.

The replication team decided to calculate the transaction sizes during
receiving the data
form the io, and then to store it in the hash, and no protocol
modifications would be required then.
I suggested to buffer the transaction separately, and then push it into the
relay
log in the data frame, storing the length.

Then, we have MDEV-19687 <https://jira.mariadb.org/browse/MDEV-19687>,
which was the supertask for MDEV-20925
<https://jira.mariadb.org/browse/MDEV-20925>.

The aim is to parallelize the replication on the slave side. They are
implementing their
own parallel circular buffer with single-writer, multiple-reader use case:
the parallel workers are going to pick out the transaction data, and
resolve the commit order
later.

Sergei petrunya, you've been questioning:

> I see SEQ_READ_APPEND is only used for Relay Log on the slave. Afaiu, the
> relay
> log has only only one producer, the network io thread.


Andrei clarified, that there can actually be multiple sources. However they
are just going to
create a working queue for each source.


===============================================================
To underline, the rationale is changed for this IO_CACHE improvement:

* The replication relay log is not the use-case anymore, since a separate,
much simplified
circular buffer is going to be implemented (AFAIU, Sachin is in the middle
of the progress
 of MDEV-19687 <https://jira.mariadb.org/browse/MDEV-19687>).

* The binlog commit is instead a good case! Most likely I have forgotten
about it, when
I was writing the rationale. But anyway it wasn't clear for me, can the
transactions be written
in parallel there.
 On the reader side, the network replication sender should not block the
committing process
by reading from the binlog.



Regards,
Nikita

References

Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
From: Nikita Malyavin, 2021-05-18
Re: Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
From: Sergey Petrunia, 2021-05-19
Re: Request for a discusison: A fine-grained concurrent ring buffer mode for IO_CACHE
From: Nikita Malyavin, 2021-05-19