maria-developers team mailing list archive
Mailing list archive
Re: Ideas for improving MariaDB/MySQL replication
On Jun 24, Kristian Nielsen wrote:
> High-Level Specification
> Generators and consumbers
> We have the two concepts:
> 1. Event _generators_, that produce events describing all changes to
> data in a server.
> 2. Event consumers, that receive such events and use them in various
> Examples of event generators is execution of SQL statements, which
> generates events like those used for statement-based replication.
> Another example is PBXT engine-level replication.
> An example of an event consumer is the writing of the binlog on a
> Some event generators are not really plugins. Rather, there are
> specific points in the server where events are generated. However, a
> generator can be part of a plugin, for example a PBXT engine-level
> replication event generator would be part of the PBXT storage engine
> plugin. And for example we could write a filter plugin, which would be
> stacked on top of an existing generator and provide the same event
> types and interfaces, but filtered in some way (for example by
> removing certain events on the master side, or by re-writing events in
> certain ways).
> Event consumers on the other hand could be a plugin.
> One generator can be stacked on top of another. This means that a
> generator on top (for example row-based events) will handle some
> events itself (eg. non-deterministic update in mixed-mode binlogging).
> Other events that it does not want to or cannot handle (for example
> deterministic delete or DDL) will be defered to the generator below
> (for example statement-based events).
There's a problem with this idea. Say, Event B is nested in Event A:
... ... |<- Event A ... .. .. ->| .. .. ..
* * * * * |<- Event B ->| * * * *
This is fine. But what about
... ... |<- Event A ... ->| .. .. ..
* * * * * |<- Event B ->| * * * *
In the latter case no event is nested in the other, and no level can
simply dever to the other.
I don't know a solution for this, I'm just hoping the above situation is
impossible. At least, I could not find an example of "overlapping"
> Default materialisation format
> While the proposed API doesn't _require_ materialisation, we can still
> think about providing the _option_ for built-in materialisation. This
> could be useful if such materialisation is made suitable for transport
> to a different server (eg. no endian-dependance etc). If there is a
> facility for such materialisation built-in to the API, it becomes
> possible to write something like a generic binlog plugin or generic
> network transport plugin. This would be really useful for eg. PBXT
> engine-level replication, as it could be implemented without having to
> re-invent a binlog format.
> I added in the proposed API a simple facility to materialise every
> event as a string of bytes. To use this, I still need to add a
> suitable facility to de-materialise the event.
Couldn't that be done not in the API or generator, but as a filter
somewhere up the chain ?
> So I think maybe it is better to add such a generic materialisation
> facility on top of the basic event generator API.
> Another fundamental question about the design is the level of
> encapsulation used for the API.
> At the implementation level, a lot of the work is basically to pull
> out all of the needed information from the THD object/context. The API
> I propose tries to _not_ expose the THD to consumers. Instead it
> provides accessor functions for all the bits and pieces relevant to
> each replication event, while the event class itself likely will be
> more or less just an encapsulated THD.
> So an alternative would be to have a generic event that was just
> (type, THD). Then consumers could just pull out whatever information
> they want from the THD. The THD implementation is already exposed to
> storage engines. This would of course greatly reduce the size of the
no, it's not. THD is not exposed to engines (unless they define
MYSQL_SERVER but then it's not our problem), they use accessor
> API, eliminating lots of class definitions and accessor functions.
> Though arguably it wouldn't really simplify the API, as the complexity
> would just be in understanding the THD class.
> For now, the API is proposed without exposing the THD class. (Similar
> encapsulation could be added in actual implementation to also not
> expose TABLE and similar classes).
> Low-Level Design
> A consumer is implented as a virtual class (interface). There is one virtual
> function for every event that can be received. A consumer would derive from
hm. This part I don't understand.
How would that work ? A consumer want to see a uniform stream of events,
perhaps for sending them to a slave. Why would you need different
consimers and different methods for different events ?
I'd just have one method, receive_event(rpl_event_base *)
> the base class and override methods for the events it wants to receive.
> There are methods for a consumer to register itself to receive events from
> each generator. I still need to find a way for a consumer in one plugin to
> register itself with a generator implemented in another plugin (eg. PBXT
> engine-level replication). I also need to add a way for consumers to
> de-register themselves.
Let's say that all generators are hard-coded and statically compiled in.
You can think about how to dynamically register them (e.g. pbxt) later.
> The current design has consumer callbacks return 0 for success and error code
> otherwise. I still need to think more about whether this is useful (ie. what
> is the semantics of returning an error from a consumer callback).
> Each event passed to consumers is defined as a class with public accessor
> methods to a private context (which is mostly the THD).
> My intension is to make all events passed around const, so that the same event
> can be passed to each of multiple registered consumers (and to emphasise that
> consumers do not have the ability to modify events). It still needs to be seen
> whether that const-ness will be feasible in practise without very heavy
> modification/constification of exiting code.
> Virtual base class for generated replication events.
> This is the parent of events generated from all kinds of generators. Only
> child classes can be instantiated.
> This class can be used by code that wants to treat events in a generic way,
> without any knowledge of event details. I still need to decide whether such
> generic code is sensible.
sure it is. write event to binlog. send it to a slave. add a checksum,
encrypt, compress - all these consumers can treat an event as an opaque
stream of bytes.
> class rpl_event_base
> int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
> Also, I still need to think about whether it is at all useful to be able
> to generically materialise an event at this level. It may be that any
> binlog/transport will in any case need to undertand more of the format of
> events, so that such materialisation/transport is better done at a
> different layer.
Right, I'm doubful too.
Say, to materialize a statement level event you need to know what
exactly bits of the context you want to include. When replicating to
MariaDB it's one set, when repicating to identically configured MariaDB
of the same version it's another set, and when replicating to, say, DB2,
it's probably a different (larger) set.
> The global transaction id is unique cross-server.
> It can be used to identify the position from which to start a slave
> replicating from a master.
> This global ID is only available once the transaction is decided to commit
> by the TC manager / primary redundancy service. This TC also allocates the
> ID and decides the exact semantics (can there be gaps, etc); however the
> format is fixed (cluster_id, running_counter).
uhm. XID format is defined by the XA standard. An XID consists of
- format ID (unsigned long)
- global transaction ID - up to 64 bytes
- branch qualifier - up to 64 bytes
as your transaction id is smaller, you will need to consider XID a part
of the "context" - in cases where XID was generated externally.
Same about binlog position - which is a "transaction id" in the MySQL
replication. It doesn't fit into your scheme, so it will have to be a
part of the context. And unless the redundancy service will be allowed
to ignore your transaction ids, MySQL native replication will not fit
into the API.