maria-developers team mailing list archive

Thread
Date

Re: prospective GSOC 2017 student [MDEV-7502]

To: ibrar arshad <ibrararshad80@xxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Sun, 19 Mar 2017 18:53:35 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CACPHCEJ70sCw7HNnspo+sGHWY2vzq-nngHLTq_iNgOD5NvN3aQ@mail.gmail.com>
User-agent: Mutt/1.5.24 (2015-08-30)

Hi, ibrar!

On Mar 19, ibrar arshad wrote:
> Hi,
> 
> My name is Ibrar Arshad and I am interested in working on the task of
> automatic slave provisioning(ticket: MDEV-7502
> <https://jira.mariadb.org/browse/MDEV-7502>) during GSOC 2017. I have read
> the summary on the ticket and have achieved a fair understanding of the
> problem and I am working towards ironing out the implementation details.
> The use-case as I understand is that we want the slave to auto-replicate
> the data from master once pointed the master

Yes.

> and we want to do it in such a manner that the binlog events from
> current master position as well as the old data chunks are relayed to
> the slave in a parallel fashion.

Not necessarily. There could be other approaches too.

May be even bulk-loading the data would be faster than sending data in
chunks and applying events in parallel. Or may be not.

> I have a few questions related to the proposal:
> 
>    1. After reading a few pages on replication, my understanding is
>    that after "CHANGE MASTER TO" and "START SLAVE", master starts
>    sending binlog events from its current position to the slave which
>    slave starts applying. The usual replication approach is to get the
>    current binlog position on master, backup all the data till this
>    position from master to slave, point slave to this position(or
>    GTID) via "CHANGE MASTER TO", and START SLAVE to start replicating
>    bin events from master. But for MDEV-7502, we want the normal
>    events and old data chunks to be transmitted in parallel.

The main thing we want for MDEV-7502 is to avoid the step of "backup all
the data... restore on the slave".

>    The ticket summary mentions using separate domain_ids to send the
>    new and old data in parallel, does there exist a way to do so
>    currently? How can domain id be used here? Can we currently point
>    the slave to 2 different bin positions on a single master and
>    expect the master to send events from both positions?  Or will this
>    require some sort of new process/thread implementation on master to
>    do so?

No, this won't. I didn't actually try to connect twice from a slave to
the same master, but I suspect it'll either work or can be fixed to work
rather easily.

>    2. There are at-least two other approaches mentioned in the
>    ticket's comments section. It doesn't seem like that a single
>    approach has been finalized. This project doesn't seem to have a
>    mentor yet to provide guidance so which approach should an
>    applicant pursue further?

Yes, the project suggests few different approaches. You can discuss them
in your proposal and suggest the one you think is the best.
There will be a mentor, don't worry. It just wasn't formally assigned
yet.

> I would like to discuss the project approaches and implementation
> further in detail before submitting a proposal so can somebody please
> answer my queries and further suggest pointers to this project
> specific material which I can go through to get a deeper
> understanding? Thanks.

Hmm..

For example, I've mentioned above that it's not clear whether sending
all data first and bulk-loading them will be faster or slower than
interleaving data anf RBR binlog events.

You can test it. Get a big table dump (not huge, but something that
loads a noticeable amount of time). Then get a bunch of single-row
update/delete/updates.
And try 1) load the dump, do updates. 2) do updates in parallel with the
dump. Just take care to enable at least the primary key, and made sure
that in both approaches you get the same table content at the end.
That's a simple test, no coding involved, but it'll give some
understanding as to what approach is faster on the slave side.

Regards,
Sergei
Chief Architect MariaDB
and security@xxxxxxxxxxx

Follow ups

Re: prospective GSOC 2017 student [MDEV-7502]
From: Stephane Varoqui, 2017-03-19

References

prospective GSOC 2017 student [MDEV-7502]
From: ibrar arshad, 2017-03-19