← Back to team overview

maria-developers team mailing list archive

Re: Ideas for improving MariaDB/MySQL replication


Sergei Golubchik <serg@xxxxxxxxxxxx> writes:

> Hi, Kristian!

Hi, thanks for your comments! A couple of questions inline, and some

> On Jun 24, Kristian Nielsen wrote:

>> At the implementation level, a lot of the work is basically to pull
>> out all of the needed information from the THD object/context. The API
>> I propose tries to _not_ expose the THD to consumers. Instead it
>> provides accessor functions for all the bits and pieces relevant to
> of course
>> each replication event, while the event class itself likely will be
>> more or less just an encapsulated THD.
>> So an alternative would be to have a generic event that was just
>> (type, THD).  Then consumers could just pull out whatever information
>> they want from the THD. The THD implementation is already exposed to
>> storage engines. This would of course greatly reduce the size of the
> no, it's not. THD is not exposed to engines (unless they define
> MYSQL_SERVER but then it's not our problem), they use accessor
> functions.

Ah, I see. Ok good, so it makes sense to use accessor functions in the
replication APIs also, with no trace of THD.

>> API, eliminating lots of class definitions and accessor functions.
>> Though arguably it wouldn't really simplify the API, as the complexity
>> would just be in understanding the THD class.
>> For now, the API is proposed without exposing the THD class. (Similar
>> encapsulation could be added in actual implementation to also not
>> expose TABLE and similar classes).
> completely agree

Ok, so some follow up questions:

1. Do I understand correctly that you agree that the API should also
encapsulate TABLE and similar classes? These _are_ exposed to storage engines
as far as I can see.

2. If TABLE and so on should be encapsulated, there will be the issue of
having iterators to run over columns, etc. Do we already have standard classes
for this that could be used? Or should I do this modelled using the iterators
of the Stardard C++ library, for example?

(I would like to make the new API fit in as well as possible with the existing
MySQL/MariaDB code, which you know much better).

>> A consumer is implented as a virtual class (interface). There is one virtual
>> function for every event that can be received. A consumer would derive from
> hm. This part I don't understand.
> How would that work ? A consumer want to see a uniform stream of events,
> perhaps for sending them to a slave. Why would you need different
> consimers and different methods for different events ?
> I'd just have one method, receive_event(rpl_event_base *)

Ok, so do I understand you correctly that class rpl_event_base would have a
type field, and the consumer could then down-cast to the appropriate specific
event class based on the type?

  receive_event(const rpl_event_base *generic_event)
    switch (generic_event->type)
      case rpl_event_base::RPL_EVENT_STATEMENT_QUERY:
        const rpl_event_statement_query *ev=
          static_cast<const rpl_event_statement_query *>(generic_event);
        do_stuff(ev->get_query_string(), ...);
      case rpl_event_base::RPL_EVENT_ROW_UPDATE:
        const rpl_event_row_update *ev=
          static_cast<const rpl_event_row_update *>(generic_event);
        do_stuff(ev->get_after_image(), ...);

I have always disliked having such type field and upcasting. So I tried to
make an API where it was not needed. Like this:

  class my_event_consumer
    int stmt_query(const rpl_event_statement_query *ev)
      do_stuff(ev->get_query_string(), ...);
    int row_update(const rpl_event_row_update *ev)
      do_stuff(ev->get_after_image(), ...);

Maybe it was a stupid idea. I don't mind doing the simpler one with just a
receive_event() method and a type field.

(Actually, I think my dislike is mainly of class hierarchies which start out
with full abstraction, taking great care that everything type specific is
handled inside generic virtual methods of the base class. And then at some
points this gets tricky, and bits and pieces of outside code start to inspect
the type and do downcast and type-specific stuff. And you end up with
something that has all the complexity of a polymorphic class hierarchy, but
none of the elegance. This is not the case here, as the events are just data
containers, they do not have complex logic attached).

>>   /*
>>     The global transaction id is unique cross-server.
>>     It can be used to identify the position from which to start a slave
>>     replicating from a master.
>>     This global ID is only available once the transaction is decided to commit
>>     by the TC manager / primary redundancy service. This TC also allocates the
>>     ID and decides the exact semantics (can there be gaps, etc); however the
>>     format is fixed (cluster_id, running_counter).
> uhm. XID format is defined by the XA standard. An XID consists of
>  - format ID (unsigned long)
>  - global transaction ID - up to 64 bytes
>  - branch qualifier - up to 64 bytes
> as your transaction id is smaller, you will need to consider XID a part
> of the "context" - in cases where XID was generated externally.
> Same about binlog position - which is a "transaction id" in the MySQL
> replication. It doesn't fit into your scheme, so it will have to be a
> part of the context. And unless the redundancy service will be allowed
> to ignore your transaction ids, MySQL native replication will not fit
> into the API.

Yes, good points.

Ok, so my idea with the global transaction ID is following the previous
discussion, that there can be a primary redundancy plugin, and this gets to
control the commit order and create the global transaction IDs.

And the global transaction ID is used to allow slaves to easily synchronise to
any master. As long as a slave commits the last global transaction ID applied,
it can connect to any master and know where to start replicating (or determine
if the slave is actually ahead of the would-be master). Etc.

(I do not know if XID can be used for this purpose, but even if not your point
is still valid).

So maybe it is wrong to fix a particular global transaction ID format at this
level of API.

One option is to have only the local transaction ID at this level of API. Then
the primary redundancy plugin / TC manager should expose an API that allows
consumers (and others) to look up the global transaction ID from the local
transaction ID (I believe it will need to maintain such mapping anyway).

Another option is to expose a global transaction ID of generic format at this
layer (we could even use the XA standard XID format).

>> class rpl_event_base
>> {
> ...
>>   int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
> ...
>>     Also, I still need to think about whether it is at all useful to be able
>>     to generically materialise an event at this level. It may be that any
>>     binlog/transport will in any case need to undertand more of the format of
>>     events, so that such materialisation/transport is better done at a
>>     different layer.
> Right, I'm doubful too.
> Say, to materialize a statement level event you need to know what
> exactly bits of the context you want to include. When replicating to
> MariaDB it's one set, when repicating to identically configured MariaDB
> of the same version it's another set, and when replicating to, say, DB2,
> it's probably a different (larger) set.

Yes, exactly. So that's the main reason I'd like to have a non-materialised
API, and them possibly build materialsation on top.

What I have been thinking is to have a default (but not mandatory) event
format. I am thinking to maybe use Google protocol buffers (they seem fairly
good for this purpose, and they are quite popular, eg. Monty is planning to
use them for dynamic columns). With such a format, it would be possible to
write generic plugins for a binlog implementation, direct transport to slave,
checksum/encrypt/compress etc. etc. Which I agree would be nice (and such
plugins don't really want to have to handle complete materialisation of any
possible event themselves from scratch).

Incidentally, I think the existing binlog format is really hopeless to use
with such generic plugins, it seems intricately tied to a particular binlog
format (like including master binlog file names and file offsets inside of

>> One generator can be stacked on top of another. This means that a
>> generator on top (for example row-based events) will handle some
>> events itself (eg. non-deterministic update in mixed-mode binlogging).
>> Other events that it does not want to or cannot handle (for example
>> deterministic delete or DDL) will be defered to the generator below
>> (for example statement-based events).
> There's a problem with this idea. Say, Event B is nested in Event A:
>    ... ... |<-    Event A ... .. .. ->| .. .. ..
>    *  *  *  *  * |<-   Event B  ->| *  *  *  *
> This is fine. But what about
>    ... ... |<-    Event A ... ->| .. .. ..
>    *  *  *  *  * |<-    Event B   ->| *  *  *  *
> In the latter case no event is nested in the other, and no level can
> simply dever to the other.
> I don't know a solution for this, I'm just hoping the above situation is
> impossible. At least, I could not find an example of "overlapping"
> events.

Another way of thinking about this is that we have one layer above handling
(or not handling) an event that can be generated below.

So if a statement is handled using row-based replication events, the row-based
replication event generator on top will choose to discard the corrosponding
event from the statement-based generator below. If it is not handled, it the
row-based will pass through the event from statement-based. (This is one
reason I wanted event generation to be very cheap (no materialisation); I
prefer this way of generating below and discarding above to having the layer
above set and clear flags (or whatever) for the layer below about whether to
generate events or not.)

So one case where this becomes a problem is if we have a multi-table update
where one table is PBXT and another is not, and we are using PBXT engine-level
replication on top of statement-based replication. In this case, one half of
the statement-based event is handled by the layer above, but the other is
not. So we cannot deal with this situation.

(We could of course think of ways to handle this. For example, modify the
statement event to include a flag to only touch the non-PBXT tables when
applied on the slave. This would correspond to slicing up the events to make
them be nested properly in one-another in the nested-event
description. Probably it is better just to not support such a scenario,
trowing an error.)

>> I added in the proposed API a simple facility to materialise every
>> event as a string of bytes. To use this, I still need to add a
>> suitable facility to de-materialise the event.
> Couldn't that be done not in the API or generator, but as a filter
> somewhere up the chain ?

Yes. It's interesting that it could be a filter/generator higher in the stack,
I had not thought about that.

 - Kristian.

Follow ups