maria-developers team mailing list archive

Thread
Date
Updated (by Knielsen): Replication API for stacked event generators (120)

To: knielsen@xxxxxxxxxxxxxxx
From: worklog-noreply@xxxxxxxxxxxx
Date: Thu, 24 Jun 2010 14:29:38 +0000 (UTC)
Delivered-to: worklog-db@xxxxxxxxxxxx
-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: Knielsen
COPIES TO......: 
CATEGORY.......: Server-Sprint
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0

PROGRESS NOTES:

-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Category updated.
--- /tmp/wklog.120.old.7440     2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440     2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Server-RawIdeaBin
+Server-Sprint

-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Status updated.
--- /tmp/wklog.120.old.7440     2010-06-24 14:29:38.000000000 +0000
+++ /tmp/wklog.120.new.7440     2010-06-24 14:29:38.000000000 +0000
@@ -1 +1 @@
-Un-Assigned
+Assigned

-=-=(Knielsen - Thu, 24 Jun 2010, 14:29)=-=-
Low Level Design modified.
--- /tmp/wklog.120.old.7355     2010-06-24 14:29:11.000000000 +0000
+++ /tmp/wklog.120.new.7355     2010-06-24 14:29:11.000000000 +0000
@@ -1 +1,385 @@
+A consumer is implented as a virtual class (interface). There is one virtual
+function for every event that can be received. A consumer would derive from
+the base class and override methods for the events it wants to receive.
+
+There is one consumer interface for each generator. When a generator A is
+stacked on B, the consumer interface for A inherits from the interface for
+B. This way, when A defers an event to B, the consumer for A will receive the
+corresponding event from B.
+
+There are methods for a consumer to register itself to receive events from
+each generator. I still need to find a way for a consumer in one plugin to
+register itself with a generator implemented in another plugin (eg. PBXT
+engine-level replication). I also need to add a way for consumers to
+de-register themselves.
+
+The current design has consumer callbacks return 0 for success and error code
+otherwise. I still need to think more about whether this is useful (ie. what
+is the semantics of returning an error from a consumer callback).
+
+Each event passed to consumers is defined as a class with public accessor
+methods to a private context (which is mostly the THD).
+
+My intension is to make all events passed around const, so that the same event
+can be passed to each of multiple registered consumers (and to emphasise that
+consumers do not have the ability to modify events). It still needs to be seen
+whether that const-ness will be feasible in practise without very heavy
+modification/constification of exiting code.
+
+What follows is a partial draft of a possible definition of the API as
+concrete C++ class definitions.
+
+-----------------------------------------------------------------------
+
+/*
+  Virtual base class for generated replication events.
+
+  This is the parent of events generated from all kinds of generators. Only
+  child classes can be instantiated.
+
+  This class can be used by code that wants to treat events in a generic way,
+  without any knowledge of event details. I still need to decide whether such
+  generic code is sensible.
+*/
+class rpl_event_base
+{
+  /*
+    Maybe we will want the ability to materialise an event to a standard
+    binary format. This could be achieved with a base method like this. The
+    actual materialisation would be implemented in each deriving class. The
+    public methods would provide different interfaces for specifying the
+    buffer or for writing directly into IO_CACHE or file.
+  */
+
+  /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
+  int materialise(uchar *buffer, size_t buflen) const;
+  /*
+    Returns NULL on error or else malloc()ed buffer with materialised event,
+    caller must free().
+  */
+  uchar *materialise() const;
+  /* Same but using passed in memroot. */
+  uchar *materialise(mem_root *memroot) const;
+  /*
+    Materialise to user-supplied writer function (could write directly to file
+    or the like).
+  */
+  int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;
+
+  /*
+    As to for what to do with a materialised event, there are a couple of
+    possibilities.
+
+    One is to have a de_materialise() method somewhere that can construct an
+    rpl_event_base (really a derived class of course) from a buffer or writer
+    function. This would require each accessor function to conditionally read
+    its data from either THD context or buffer (GCC is able to optimise
+    several such conditionals in multiple accessor function calls into one
+    conditional), or we can make all accessors virtual if the performance hit
+    is acceptable.
+
+    Another is to have different classes for accessing events read from
+    materialised event data.
+
+    Also, I still need to think about whether it is at all useful to be able
+    to generically materialise an event at this level. It may be that any
+    binlog/transport will in any case need to undertand more of the format of
+    events, so that such materialisation/transport is better done at a
+    different layer.
+  */
+
+protected:
+  /* Implementation which is the basis for materialise(). */
+  virtual int do_materialise(int (*writer)(uchar *data, size_t len,
+                                           void *context)) const = 0;
+
+private:
+  /* Virtual base class, private constructor to prevent instantiation. */
+  rpl_event_base();
+};
+
+
+/*
+  These are the event types output from the transaction event generator.
+
+  This generator is not stacked on anything.
+
+  The transaction event generator marks the start and end (commit or rollback)
+  of transactions. It also gives information about whether the transaction was
+  a full transaction or autocommitted statement, whether transactional tables
+  were involved, whether non-transactional tables were involved, and XA
+  information (ToDo).
+*/
+
+/* Base class for transaction events. */
+class rpl_event_transaction_base : public rpl_event_base
+{
+public:
+  /*
+    Get the local transaction id. This idea is only unique within one server.
+    It is allocated whenever a new transaction is started.
+    Can be used to identify events belonging to the same transaction in a
+    binlog-like stream of events streamed in parallel among multiple
+    transactions.
+  */
+  uint64_t get_local_trx_id() const { return thd->local_trx_id; };
+
+  bool get_is_autocommit() const;
+
+private:
+  /* The context is the THD. */
+  THD *thd;
+
+  rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
+};
+
+/* Transaction start event. */
+class rpl_event_transaction_start : public rpl_event_transaction_base
+{
+
+};
+
+/* Transaction commit. */
+class rpl_event_transaction_commit : public rpl_event_transaction_base
+{
+public:
+  /*
+    The global transaction id is unique cross-server.
+
+    It can be used to identify the position from which to start a slave
+    replicating from a master.
+
+    This global ID is only available once the transaction is decided to commit
+    by the TC manager / primary redundancy service. This TC also allocates the
+    ID and decides the exact semantics (can there be gaps, etc); however the
+    format is fixed (cluster_id, running_counter).
+  */
+  struct global_transaction_id
+  {
+    uint32_t cluster_id;
+    uint64_t counter;
+  };
+
+  const global_transaction_id *get_global_transaction_id() const;
+};
+
+/* Transaction rollback. */
+class rpl_event_transaction_rollback : public rpl_event_transaction_base
+{
+
+};
+
+
+/* Base class for statement events. */
+class rpl_event_statement_base : public rpl_event_base
+{
+public:
+  LEX_STRING get_current_db() const;
+};
+
+class rpl_event_statement_start : public rpl_event_statement_base
+{
+
+};
+
+class rpl_event_statement_end : public rpl_event_statement_base
+{
+public:
+  int get_errorcode() const;
+};
+
+class rpl_event_statement_query : public rpl_event_statement_base
+{
+public:
+  LEX_STRING get_query_string();
+  ulong get_sql_mode();
+  const CHARSET_INFO *get_character_set_client();
+  const CHARSET_INFO *get_collation_connection();
+  const CHARSET_INFO *get_collation_server();
+  const CHARSET_INFO *get_collation_default_db();
+
+  /*
+    Access to relevant flags that affect query execution.
+
+    Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
+  */
+  enum flag_bits
+  {
+    STMT_FOREIGN_KEY_CHECKS,                     // @@foreign_key_checks
+    STMT_UNIQUE_KEY_CHECKS,                      // @@unique_checks
+    STMT_AUTO_IS_NULL,                           // @@sql_auto_is_null
+  };
+  uint32_t get_flags();
+
+  ulong get_auto_increment_offset();
+  ulong get_auto_increment_increment();
+
+  // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
+  // INSERT_ID; random seed; user variables.
+  //
+  // We probably also need get_uses_temporary_table(), get_used_user_vars(),
+  // get_uses_auto_increment() and so on, so a consumer can get more
+  // information about what kind of context information a query will need when
+  // executed on a slave.
+};
+
+class rpl_event_statement_load_query : public rpl_event_statement_query
+{
+
+};
+
+/*
+  This event is fired with blocks of data for files read (from server-local
+  file or client connection) for LOAD DATA.
+*/
+class rpl_event_statement_load_data_block : public rpl_event_statement_base
+{
+public:
+  struct block
+  {
+    const uchar *ptr;
+    size_t size;
+  };
+  block get_block() const;
+};
+
+/* Base class for row-based replication events. */
+class rpl_event_row_base : public rpl_event_base
+{
+public:
+  /*
+    Access to relevant handler extra flags and other flags that affect row
+    operations.
+
+    Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
+  */
+  enum flag_bits
+  {
+    ROW_WRITE_CAN_REPLACE,                      // HA_EXTRA_WRITE_CAN_REPLACE
+    ROW_IGNORE_DUP_KEY,                         // HA_EXTRA_IGNORE_DUP_KEY
+    ROW_IGNORE_NO_KEY,                          // HA_EXTRA_IGNORE_NO_KEY
+    ROW_DISABLE_FOREIGN_KEY_CHECKS,             // ! @@foreign_key_checks
+    ROW_DISABLE_UNIQUE_KEY_CHECKS,              // ! @@unique_checks
+  };
+  uint32_t get_flags();
+
+  /* Access to list of tables modified. */
+  class table_iterator
+  {
+  public:
+    /* Returns table, NULL after last. */
+    const TABLE *get_next();
+  private:
+    // ...
+  };
+  table_iterator get_modified_tables() const;
+
+private:
+  /* Context used to provide accessors. */
+  THD *thd;
+
+protected:
+  rpl_event_row_base(THD *_thd) : thd(_thd) { }
+};
+
+
+class rpl_event_row_write : public rpl_event_row_base
+{
+public:
+  const BITMAP *get_write_set() const;
+  const uchar *get_after_image() const;
+};
+
+class rpl_event_row_update : public rpl_event_row_base
+{
+public:
+  const BITMAP *get_read_set() const;
+  const BITMAP *get_write_set() const;
+  const uchar *get_before_image() const;
+  const uchar *get_after_image() const;
+};
+
+class rpl_event_row_delete : public rpl_event_row_base
+{
+public:
+  const BITMAP *get_read_set() const;
+  const uchar *get_before_image() const;
+};
+
+
+/*
+  Event consumer callbacks.
+
+  An event consumer registers with an event generator to receive event
+  notifications from that generator.
+
+  The consumer has callbacks (in the form of virtual functions) for the
+  individual event types the consumer is interested in. Only callbacks that
+  are non-NULL will be invoked. If an event applies to multiple callbacks in a
+  single callback struct, it will only be passed to the most specific non-NULL
+  callback (so events never fire more than once per registration).
+
+  The lifetime of the memory holding the event is only for the duration of the
+  callback invocation, unless otherwise noted.
+
+  Callbacks return 0 for success or error code (ToDo: does this make sense?).
+*/
+
+struct rpl_event_consumer_transaction
+{
+  virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
+  virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
+  virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
+};
+
+/*
+  Consuming statement-based events.
+
+  The statement event generator is stacked on top of the transaction event
+  generator, so we can receive those events as well.
+*/
+struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
+{
+  virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
+  virtual int stmt_end(const rpl_event_statement_end *) { return 0; }
+
+  virtual int stmt_query(const rpl_event_statement_query *) { return 0; }
+
+  /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
+  virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
+    { return 0; }
+
+  /*
+    These are specific kinds of statements; if specified they override
+    consume_stmt_query() for the corresponding event.
+  */
+  virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
+    { return stmt_query(ev); }
+};
+
+/*
+  Consuming row-based events.
+
+  The row event generator is stacked on top of the statement event generator.
+*/
+struct rpl_event_consumer_row : public rpl_event_consumer_statement
+{
+  virtual int row_write(const rpl_event_row_write *) { return 0; }
+  virtual int row_update(const rpl_event_row_update *) { return 0; }
+  virtual int row_delete(const rpl_event_row_delete *) { return 0; }
+};
+
+
+/*
+  Registration functions.
+
+  ToDo: Make a way to de-register.
+
+  ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
+  registration method.
+*/
+
+int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
+int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
+int rpl_event_row_register(const rpl_event_consumer_row *cbs);
 

-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341     2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341     2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+   server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+   needed in events: user-defined variables, random seed, character sets,
+   table column names and types, etc. etc. If we make the API based on
+   materialisation, then the initial decision about which context information
+   to include with which events will have to be done in the API, while ideally
+   we want this decision to be done by the individual consumer plugin. There
+   will this be a conflict between what to include (to allow consumers access)
+   and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+   make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+   inflexible), it is unlikely to be directly useful for transport
+   (eg. binlog), so it will need to be re-materialised into a different format
+   anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+   materialising an event in a generator below which would be completely
+   unused. Thus there would be a need for the upper generator to somehow
+   notify the lower generator ahead of event generation time to not fire an
+   event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+   byte buffer) will be simpler than the complex class hierarchy proposed here
+   with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+   consumer code on a generator that produces events at the source of
+   execution and on a generator that produces events from eg. reading them
+   from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And  I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave. 
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
 

-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120

-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516      2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516      2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
 
 Event generators can be stacked, and a generator may defer event generation to
 the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.

-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.

Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).



DESCRIPTION:

A part of the replication project, MWL#107.

Events are produced by event Generators. Examples are

 - Generation of statement-based replication events
 - Generation of row-based events
 - Generation of PBXT engine-level replication events

and maybe reading of events from relay log on slave may also be an example of
generating events.

Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.


HIGH-LEVEL SPECIFICATION:



Generators and consumbers
-------------------------

We have the two concepts:

1. Event _generators_, that produce events describing all changes to data in a
   server.

2. Event consumers, that receive such events and use them in various ways.

Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.

An example of an event consumer is the writing of the binlog on a master.

Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.

Event consumers on the other hand could be a plugin.

One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).


Materialisation (or not)
------------------------

A central decision is how to represent events that are generated in the API at
the point of generation.

I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.

Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.

Some reasons to avoid using materialised events in the API:

 - Replication events have a _lot_ of detailed context information that can be
   needed in events: user-defined variables, random seed, character sets,
   table column names and types, etc. etc. If we make the API based on
   materialisation, then the initial decision about which context information
   to include with which events will have to be done in the API, while ideally
   we want this decision to be done by the individual consumer plugin. There
   will this be a conflict between what to include (to allow consumers access)
   and what to exclude (to avoid excessive needless work).

 - Materialising means defining a very specific format, which will tend to
   make the API less generic and flexible.

 - Unless the materialised format is made _very_ specific (and thus very
   inflexible), it is unlikely to be directly useful for transport
   (eg. binlog), so it will need to be re-materialised into a different format
   anyway, wasting work.

 - If a generator on top handles an event, then we want to avoid wasting work
   materialising an event in a generator below which would be completely
   unused. Thus there would be a need for the upper generator to somehow
   notify the lower generator ahead of event generation time to not fire an
   event, complicating the API.

Some advantages for materialisation:

 - Using an API based on passing around some well-defined struct event (or
   byte buffer) will be simpler than the complex class hierarchy proposed here
   with no requirement for materialisation.

 - Defining a materialised format would allow an easy way to use the same
   consumer code on a generator that produces events at the source of
   execution and on a generator that produces events from eg. reading them
   from an event log.

Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.

The design proposed here aims for as little materialisation as possible.


Default materialisation format
------------------------------


While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.

I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.

However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And  I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.

So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave. 

Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).


Encapsulation
-------------

Another fundamental question about the design is the level of encapsulation
used for the API.

At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.

So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.

Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).

For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).


LOW-LEVEL DESIGN:



A consumer is implented as a virtual class (interface). There is one virtual
function for every event that can be received. A consumer would derive from
the base class and override methods for the events it wants to receive.

There is one consumer interface for each generator. When a generator A is
stacked on B, the consumer interface for A inherits from the interface for
B. This way, when A defers an event to B, the consumer for A will receive the
corresponding event from B.

There are methods for a consumer to register itself to receive events from
each generator. I still need to find a way for a consumer in one plugin to
register itself with a generator implemented in another plugin (eg. PBXT
engine-level replication). I also need to add a way for consumers to
de-register themselves.

The current design has consumer callbacks return 0 for success and error code
otherwise. I still need to think more about whether this is useful (ie. what
is the semantics of returning an error from a consumer callback).

Each event passed to consumers is defined as a class with public accessor
methods to a private context (which is mostly the THD).

My intension is to make all events passed around const, so that the same event
can be passed to each of multiple registered consumers (and to emphasise that
consumers do not have the ability to modify events). It still needs to be seen
whether that const-ness will be feasible in practise without very heavy
modification/constification of exiting code.

What follows is a partial draft of a possible definition of the API as
concrete C++ class definitions.

-----------------------------------------------------------------------

/*
  Virtual base class for generated replication events.

  This is the parent of events generated from all kinds of generators. Only
  child classes can be instantiated.

  This class can be used by code that wants to treat events in a generic way,
  without any knowledge of event details. I still need to decide whether such
  generic code is sensible.
*/
class rpl_event_base
{
  /*
    Maybe we will want the ability to materialise an event to a standard
    binary format. This could be achieved with a base method like this. The
    actual materialisation would be implemented in each deriving class. The
    public methods would provide different interfaces for specifying the
    buffer or for writing directly into IO_CACHE or file.
  */

  /* Return 0 on success, -1 on error, -2 on out-of-buffer. */
  int materialise(uchar *buffer, size_t buflen) const;
  /*
    Returns NULL on error or else malloc()ed buffer with materialised event,
    caller must free().
  */
  uchar *materialise() const;
  /* Same but using passed in memroot. */
  uchar *materialise(mem_root *memroot) const;
  /*
    Materialise to user-supplied writer function (could write directly to file
    or the like).
  */
  int materialise(int (*writer)(uchar *data, size_t len, void *context)) const;

  /*
    As to for what to do with a materialised event, there are a couple of
    possibilities.

    One is to have a de_materialise() method somewhere that can construct an
    rpl_event_base (really a derived class of course) from a buffer or writer
    function. This would require each accessor function to conditionally read
    its data from either THD context or buffer (GCC is able to optimise
    several such conditionals in multiple accessor function calls into one
    conditional), or we can make all accessors virtual if the performance hit
    is acceptable.

    Another is to have different classes for accessing events read from
    materialised event data.

    Also, I still need to think about whether it is at all useful to be able
    to generically materialise an event at this level. It may be that any
    binlog/transport will in any case need to undertand more of the format of
    events, so that such materialisation/transport is better done at a
    different layer.
  */

protected:
  /* Implementation which is the basis for materialise(). */
  virtual int do_materialise(int (*writer)(uchar *data, size_t len,
                                           void *context)) const = 0;

private:
  /* Virtual base class, private constructor to prevent instantiation. */
  rpl_event_base();
};


/*
  These are the event types output from the transaction event generator.

  This generator is not stacked on anything.

  The transaction event generator marks the start and end (commit or rollback)
  of transactions. It also gives information about whether the transaction was
  a full transaction or autocommitted statement, whether transactional tables
  were involved, whether non-transactional tables were involved, and XA
  information (ToDo).
*/

/* Base class for transaction events. */
class rpl_event_transaction_base : public rpl_event_base
{
public:
  /*
    Get the local transaction id. This idea is only unique within one server.
    It is allocated whenever a new transaction is started.
    Can be used to identify events belonging to the same transaction in a
    binlog-like stream of events streamed in parallel among multiple
    transactions.
  */
  uint64_t get_local_trx_id() const { return thd->local_trx_id; };

  bool get_is_autocommit() const;

private:
  /* The context is the THD. */
  THD *thd;

  rpl_event_transaction_base(THD *_thd) : thd(_thd) { };
};

/* Transaction start event. */
class rpl_event_transaction_start : public rpl_event_transaction_base
{

};

/* Transaction commit. */
class rpl_event_transaction_commit : public rpl_event_transaction_base
{
public:
  /*
    The global transaction id is unique cross-server.

    It can be used to identify the position from which to start a slave
    replicating from a master.

    This global ID is only available once the transaction is decided to commit
    by the TC manager / primary redundancy service. This TC also allocates the
    ID and decides the exact semantics (can there be gaps, etc); however the
    format is fixed (cluster_id, running_counter).
  */
  struct global_transaction_id
  {
    uint32_t cluster_id;
    uint64_t counter;
  };

  const global_transaction_id *get_global_transaction_id() const;
};

/* Transaction rollback. */
class rpl_event_transaction_rollback : public rpl_event_transaction_base
{

};


/* Base class for statement events. */
class rpl_event_statement_base : public rpl_event_base
{
public:
  LEX_STRING get_current_db() const;
};

class rpl_event_statement_start : public rpl_event_statement_base
{

};

class rpl_event_statement_end : public rpl_event_statement_base
{
public:
  int get_errorcode() const;
};

class rpl_event_statement_query : public rpl_event_statement_base
{
public:
  LEX_STRING get_query_string();
  ulong get_sql_mode();
  const CHARSET_INFO *get_character_set_client();
  const CHARSET_INFO *get_collation_connection();
  const CHARSET_INFO *get_collation_server();
  const CHARSET_INFO *get_collation_default_db();

  /*
    Access to relevant flags that affect query execution.

    Use as if (ev->get_flags() & (uint32)ROW_FOREIGN_KEY_CHECKS) { ... }
  */
  enum flag_bits
  {
    STMT_FOREIGN_KEY_CHECKS,                     // @@foreign_key_checks
    STMT_UNIQUE_KEY_CHECKS,                      // @@unique_checks
    STMT_AUTO_IS_NULL,                           // @@sql_auto_is_null
  };
  uint32_t get_flags();

  ulong get_auto_increment_offset();
  ulong get_auto_increment_increment();

  // And so on for: time zone; day/month names; connection id; LAST_INSERT_ID;
  // INSERT_ID; random seed; user variables.
  //
  // We probably also need get_uses_temporary_table(), get_used_user_vars(),
  // get_uses_auto_increment() and so on, so a consumer can get more
  // information about what kind of context information a query will need when
  // executed on a slave.
};

class rpl_event_statement_load_query : public rpl_event_statement_query
{

};

/*
  This event is fired with blocks of data for files read (from server-local
  file or client connection) for LOAD DATA.
*/
class rpl_event_statement_load_data_block : public rpl_event_statement_base
{
public:
  struct block
  {
    const uchar *ptr;
    size_t size;
  };
  block get_block() const;
};

/* Base class for row-based replication events. */
class rpl_event_row_base : public rpl_event_base
{
public:
  /*
    Access to relevant handler extra flags and other flags that affect row
    operations.

    Use as if (ev->get_flags() & (uint32)ROW_WRITE_CAN_REPLACE) { ... }
  */
  enum flag_bits
  {
    ROW_WRITE_CAN_REPLACE,                      // HA_EXTRA_WRITE_CAN_REPLACE
    ROW_IGNORE_DUP_KEY,                         // HA_EXTRA_IGNORE_DUP_KEY
    ROW_IGNORE_NO_KEY,                          // HA_EXTRA_IGNORE_NO_KEY
    ROW_DISABLE_FOREIGN_KEY_CHECKS,             // ! @@foreign_key_checks
    ROW_DISABLE_UNIQUE_KEY_CHECKS,              // ! @@unique_checks
  };
  uint32_t get_flags();

  /* Access to list of tables modified. */
  class table_iterator
  {
  public:
    /* Returns table, NULL after last. */
    const TABLE *get_next();
  private:
    // ...
  };
  table_iterator get_modified_tables() const;

private:
  /* Context used to provide accessors. */
  THD *thd;

protected:
  rpl_event_row_base(THD *_thd) : thd(_thd) { }
};


class rpl_event_row_write : public rpl_event_row_base
{
public:
  const BITMAP *get_write_set() const;
  const uchar *get_after_image() const;
};

class rpl_event_row_update : public rpl_event_row_base
{
public:
  const BITMAP *get_read_set() const;
  const BITMAP *get_write_set() const;
  const uchar *get_before_image() const;
  const uchar *get_after_image() const;
};

class rpl_event_row_delete : public rpl_event_row_base
{
public:
  const BITMAP *get_read_set() const;
  const uchar *get_before_image() const;
};


/*
  Event consumer callbacks.

  An event consumer registers with an event generator to receive event
  notifications from that generator.

  The consumer has callbacks (in the form of virtual functions) for the
  individual event types the consumer is interested in. Only callbacks that
  are non-NULL will be invoked. If an event applies to multiple callbacks in a
  single callback struct, it will only be passed to the most specific non-NULL
  callback (so events never fire more than once per registration).

  The lifetime of the memory holding the event is only for the duration of the
  callback invocation, unless otherwise noted.

  Callbacks return 0 for success or error code (ToDo: does this make sense?).
*/

struct rpl_event_consumer_transaction
{
  virtual int trx_start(const rpl_event_transaction_start *) { return 0; }
  virtual int trx_commit(const rpl_event_transaction_commit *) { return 0; }
  virtual int trx_rollback(const rpl_event_transaction_rollback *) { return 0; }
};

/*
  Consuming statement-based events.

  The statement event generator is stacked on top of the transaction event
  generator, so we can receive those events as well.
*/
struct rpl_event_consumer_statement : public rpl_event_consumer_transaction
{
  virtual int stmt_start(const rpl_event_statement_start *) { return 0; }
  virtual int stmt_end(const rpl_event_statement_end *) { return 0; }

  virtual int stmt_query(const rpl_event_statement_query *) { return 0; }

  /* Data for a file used in LOAD DATA [LOCAL] INFILE. */
  virtual int stmt_load_data_block(const rpl_event_statement_load_data_block *)
    { return 0; }

  /*
    These are specific kinds of statements; if specified they override
    consume_stmt_query() for the corresponding event.
  */
  virtual int stmt_load_query(const rpl_event_statement_load_query *ev)
    { return stmt_query(ev); }
};

/*
  Consuming row-based events.

  The row event generator is stacked on top of the statement event generator.
*/
struct rpl_event_consumer_row : public rpl_event_consumer_statement
{
  virtual int row_write(const rpl_event_row_write *) { return 0; }
  virtual int row_update(const rpl_event_row_update *) { return 0; }
  virtual int row_delete(const rpl_event_row_delete *) { return 0; }
};


/*
  Registration functions.

  ToDo: Make a way to de-register.

  ToDo: Find a good way for a plugin (eg. PBXT) to publish a generator
  registration method.
*/

int rpl_event_transaction_register(const rpl_event_consumer_transaction *cbs);
int rpl_event_statement_register(const rpl_event_consumer_statement *cbs);
int rpl_event_row_register(const rpl_event_consumer_row *cbs);


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)