maria-developers team mailing list archive

Thread
Date
Re: Ideas for improving MariaDB/MySQL replication

To: maria-developers@xxxxxxxxxxxxxxxxxxx, Sergei Golubchik <serg@xxxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Thu, 24 Jun 2010 16:50:41 +0200
In-reply-to: <87k4psxz4e.fsf@knielsen-hq.org> (Kristian Nielsen's message of "Mon\, 21 Jun 2010 16\:02\:25 +0200")
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1 (gnu/linux)
Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> writes:

> 1. Event generators and consumers. This is what Sergei discussed. The
> essentials of this layer is hooks in handler::write_row() and similer places
> that provides data about changes (row values for row-based replication, query
> texts for statement-based replication, etc). There is no binlog or global
> transaction ID at this layer, I think there may not even be a defined event
> format as such, just an API for consumer plugins to get the information (and
> for generator plugins to provide it).

I started to write up some more concrete design for this part. Here is the
link to the worklog (text also included below):

    http://askmonty.org/worklog/Server-Sprint/?tid=120

This part describes the lowest layer with the generation of events, and also
has a little bit of discussion on some possible layers above (replicated event
stream and binlog/transport APIs).

This API aims to be useful for something like Tungsten to implement its
own binlog format. It might also be usable for something like Galera, or
alternatively Galera might want to hook into a higher layer providing default
replication stream and slave applier thread (I'm not sure which). Also this
API would be used to implement legacy MySQL binlog format for compatibility.

This is far from a final design, I plan to work much more on the
details. However, I think it is important to start discussing more concretely
the overall shape of the API.

There are several important overall decisions I made already, which I _did_
consider carefully, but which are still open to be discussed. And any feedback
in general would be most welcome. This project is a major addition to the
server with the potential to greatly influence the future direction of server
development (or not, if we get it wrong). And it's impossible for one person to
think of everything on his own.

So any feedback welcomed; meanwhile I'll continue expanding this and other
parts of the design.

 - Kristian.

-----------------------------------------------------------------------

High-Level Specification

Generators and consumbers
-------------------------

We have the two concepts:

1. Event _generators_, that produce events describing all changes to data in a
   server.

2. Event consumers, that receive such events and use them in various ways.

Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.

An example of an event consumer is the writing of the binlog on a master.

Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.

Event consumers on the other hand could be a plugin.

One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).


Materialisation (or not)
------------------------

A central decision is how to represent events that are generated in the API at
the point of generation.

I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.

Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.

Some reasons to avoid using materialised events in the API:

 - Replication events have a _lot_ of detailed context information that can be
   needed in events: user-defined variables, random seed, character sets,
   table column names and types, etc. etc. If we make the API based on
   materialisation, then the initial decision about which context information
   to include with which events will have to be done in the API, while ideally
   we want this decision to be done by the individual consumer plugin. There
   will this be a conflict between what to include (to allow consumers access)
   and what to exclude (to avoid excessive needless work).

 - Materialising means defining a very specific format, which will tend to
   make the API less generic and flexible.

 - Unless the materialised format is made _very_ specific (and thus very
   inflexible), it is unlikely to be directly useful for transport
   (eg. binlog), so it will need to be re-materialised into a different format
   anyway, wasting work.

 - If a generator on top handles an event, then we want to avoid wasting work
   materialising an event in a generator below which would be completely
   unused. Thus there would be a need for the upper generator to somehow
   notify the lower generator ahead of event generation time to not fire an
   event, complicating the API.

Some advantages for materialisation:

 - Using an API based on passing around some well-defined struct event (or
   byte buffer) will be simpler than the complex class hierarchy proposed here
   with no requirement for materialisation.

 - Defining a materialised format would allow an easy way to use the same
   consumer code on a generator that produces events at the source of
   execution and on a generator that produces events from eg. reading them
   from an event log.

Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.

The design proposed here aims for as little materialisation as possible.


Default materialisation format
------------------------------


While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.

I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.

However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And  I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.

So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave. 

Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).


Encapsulation
-------------

Another fundamental question about the design is the level of encapsulation
used for the API.

At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.

So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.

Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).

For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).

-----------------------------------------------------------------------

Low-Level Design

A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.

There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.

There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.

The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).

Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).

My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.

What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.

-----------------------------------------------------------------------

/*
  Virtual base class for generated replication events.

  This is the parent of events generated from all kinds of generators. Only
  child classes can be instantiated.

  This class can be used by code that wants to treat events in a generic way,
  without any knowledge of event details. I still need to decide whether such
  generic code is sensible.
*/
class rpl_event_base
{
  /*
    Maybe we will want the ability to materialise an event to a standard
    binary format. This could be achieved with a base method like this. The
    actual materialisation would be implemented in each deriving class. The
    public methods would provide different interfaces for specifying the
    buffer or for writing directly into IO_CACHE or file.
  */

  /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
  int materialise(uchar *buffer, size_t buflen) const;
  /*
    Returns NULL on error or else malloc()ed buffer with materialised event,
    caller must free().
  */
  uchar *materialise() const;
  /* Same but using passed in memroot. */
  uchar *materialise(mem_root *memroot) const;
  /*
    Materialise to user-supplied writer function (could write directly to file
    or the like).
  */
  int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;

  /*
    As to for what to do with a materialised event, there are a couple of
    possibilities.

    One is to have a de_materialise() method somewhere that can construct an
    rpl_event_base (really a derived class of course) from a buffer or writer
    function. This would require each accessor function to conditionally read
    its data from either THD context or buffer (GCC is able to optimise
    several such conditionals in multiple accessor function calls into one
    conditional), or we can make all accessors virtual if the performance hit
    is acceptable.

    Another is to have different classes for accessing events read from
    materialised event data.

    Also, I still need to think about whether it is at all useful to be able
    to generically materialise an event at this level. It may be that any
    binlog/transport will in any case need to undertand more of the format of
    events, so that such materialisation/transport is better done at a
    different layer.
  */

protected:
  /* Implementation which is the basis for materialise(). */
  virtual int do_materialise(int (*writer)(uchar *data, size_t len,
                                           void *context)) const = 0;

private:
  /* Virtual base class, private constructor to prevent instantiation. */
  rpl_event_base();
};


/*
  These are the event types output from the transaction event generator.

  This generator is not stacked on anything.

  The transaction event generator marks the start and end (commit or rollback)
  of transactions. It also gives information about whether the transaction was
  a full transaction or autocommitted statement, whether transactional tables
  were involved, whether non-transactional tables were involved, and XA
  information (ToDo).
*/

/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
  /*
    Get the local transaction id. This idea is only unique within one server.
    It is allocated whenever a new transaction is started.
    Can be used to identify events belonging to the same transaction in a
    binlog-like stream of events streamed in parallel among multiple
    transactions.
  */
  uint64_t get_local_trx_id() const { return thd->local_trx_id; };

  bool get_is_autocommit() const;

private:
  /* The context is the THD. */
  THD *thd;

  rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};

/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{

};

/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
  /*
    The global transaction id is unique cross-server.

    It can be used to identify the position from which to start a slave
    replicating from a master.

    This global ID is only available once the transaction is decided to commit
    by the TC manager / primary redundancy service. This TC also allocates the
    ID and decides the exact semantics (can there be gaps, etc); however the
    format is fixed (cluster_id, running_counter).
  */
  struct global_transaction_id
  {
    uint32_t cluster_id;
    uint64_t counter;
  };

  const global_transaction_id *get_global_transaction_id() const;
};

/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{

};


/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
  LEX_STRING get_current_db() const;
};

class rpl_event_statement_start : public rpl_event_statement_base
{

};

class rpl_event_statement_end : public rpl_event_statement_base
{
public:
  int get_errorcode() const;
};

class rpl_event_statement_query : public rpl_event_statement_base
{
public:
  LEX_STRING get_query_string();
  ulong get_sql_mode();
  const CHARSET_INFO *get_character_set_client();
  const CHARSET_INFO *get_collation_connection();
  const CHARSET_INFO *get_collation_server();
  const CHARSET_INFO *get_collation_default_db();

  /*
    Access to relevant flags that affect query execution.

    Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
  */
  enum flag_bits
  {
    STMT_FOREIGN_KEY_CHECKS,                     // @@foreign_key_checks
    STMT_UNIQUE_KEY_CHECKS,                      // @@unique_checks
    STMT_AUTO_IS_NULL,                           // @@sql_auto_is_null
  };
  uint32_t get_flags();

  ulong get_auto_increment_offset();
  ulong get_auto_increment_increment();

  // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
  // INSERT_ID; random seed; user variables.
  //
  // We probably also need get_uses_temporary_table(), get_used_user_vars(),
  // get_uses_auto_increment() and so on, so a consumer can get more
  // information about what kind of context information a query will need when
  // executed on a slave.
};

class rpl_event_statement_load_query : public rpl_event_statement_query
{

};

/*
  This event is fired with blocks of data for files read (from server-local
  file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
  struct block
  {
    const uchar *ptr;
    size_t size;
  };
  block get_block() const;
};

/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
  /*
    Access to relevant handler extra flags and other flags that affect row
    operations.

    Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
  */
  enum flag_bits
  {
    ROW_WRITE_CAN_REPLACE,                      // HA_EXTRA_WRITE_CAN_REPLACE
    ROW_IGNORE_DUP_KEY,                         // HA_EXTRA_IGNORE_DUP_KEY
    ROW_IGNORE_NO_KEY,                          // HA_EXTRA_IGNORE_NO_KEY
    ROW_DISABLE_FOREIGN_KEY_CHECKS,             // ! @@foreign_key_checks
    ROW_DISABLE_UNIQUE_KEY_CHECKS,              // ! @@unique_checks
  };
  uint32_t get_flags();

  /* Access to list of tables modified. */
  class table_iterator
  {
  public:
    /* Returns table, NULL after last. */
    const TABLE *get_next();
  private:
    // ...
  };
  table_iterator get_modified_tables() const;

private:
  /* Context used to provide accessors. */
  THD *thd;

protected:
  rpl_event_row_base(THD *_thd) : thd(_thd) { }
};


class rpl_event_row_write : public rpl_event_row_base
{
public:
  const BITMAP *get_write_set() const;
  const uchar *get_after_image() const;
};

class rpl_event_row_update : public rpl_event_row_base
{
public:
  const BITMAP *get_read_set() const;
  const BITMAP *get_write_set() const;
  const uchar *get_before_image() const;
  const uchar *get_after_image() const;
};

class rpl_event_row_delete : public rpl_event_row_base
{
public:
  const BITMAP *get_read_set() const;
  const uchar *get_before_image() const;
};


/*
  Event consumer callbacks.

  An event consumer registers with an event generator to receive event
  notifications from that generator.

  The consumer has callbacks (in the form of virtual functions) for the
  individual event types the consumer is interested in. Only callbacks that
  are non-NULL will be invoked. If an event applies to multiple callbacks in a
  single callback struct, it will only be passed to the most specific non-NULL
  callback (so events never fire more than once per registration).

  The lifetime of the memory holding the event is only for the duration of the
  callback invocation, unless otherwise noted.

  Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/

struct rpl_event_consumer_transaction
{
  virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
  virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
  virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};

/*
  Consuming statement-based events.

  The statement event generator is stacked on top of the transaction event
  generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
  virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
  virtual int stmt_end(const rpl_event_statement_end *) { return 0; }

  virtual int stmt_query(const rpl_event_statement_query *) { return 0; }

  /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
  virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
    { return 0; }

  /*
    These are specific kinds of statements; if specified they override
    consume_stmt_query() for the corresponding event.
  */
  virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
    { return stmt_query(ev); }
};

/*
  Consuming row-based events.

  The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
  virtual int row_write(const rpl_event_row_write *) { return 0; }
  virtual int row_update(const rpl_event_row_update *) { return 0; }
  virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};


/*
  Registration functions.

  ToDo: Make a way to de-register.

  ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
  registration method.
*/

int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);
Follow ups

Re: Ideas for improving MariaDB/MySQL replication
From: Sergei Golubchik, 2010-07-02
References

Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-01-22
Re: Ideas for improving MariaDB/MySQL replication
From: Sergei Golubchik, 2010-05-13
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-05-13
Re: Ideas for improving MariaDB/MySQL replication
From: Sergei Golubchik, 2010-05-19
Re: Ideas for improving MariaDB/MySQL replication
From: Alex Yurchenko, 2010-05-19
Re: Ideas for improving MariaDB/MySQL replication
From: Kristian Nielsen, 2010-06-21