maria-developers team mailing list archive

Thread
Date
Updated (by Knielsen): Replication API for stacked event generators (120)

To: knielsen.public@xxxxxxxxxxxxxxx
From: worklog-noreply@xxxxxxxxxxxx
Date: Thu, 24 Jun 2010 14:28:17 +0000 (UTC)
Delivered-to: worklog-db@xxxxxxxxxxxx
-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Replication API for stacked event generators
CREATION DATE..: Mon, 07 Jun 2010, 13:13
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: 
COPIES TO......: 
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 120 (http://askmonty.org/worklog/?tid=120)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 2
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0

PROGRESS NOTES:

-=-=(Knielsen - Thu, 24 Jun 2010, 14:28)=-=-
High-Level Specification modified.
--- /tmp/wklog.120.old.7341     2010-06-24 14:28:17.000000000 +0000
+++ /tmp/wklog.120.new.7341     2010-06-24 14:28:17.000000000 +0000
@@ -1 +1,159 @@
+Generators and consumbers
+-------------------------
+
+We have the two concepts:
+
+1. Event _generators_, that produce events describing all changes to data in a
+   server.
+
+2. Event consumers, that receive such events and use them in various ways.
+
+Examples of event generators is execution of SQL statements, which generates
+events like those used for statement-based replication. Another example is
+PBXT engine-level replication.
+
+An example of an event consumer is the writing of the binlog on a master.
+
+Event generators are not really plugins. Rather, there are specific points in
+the server where events are generated. However, a generator can be part of a
+plugin, for example a PBXT engine-level replication event generator would be
+part of the PBXT storage engine plugin.
+
+Event consumers on the other hand could be a plugin.
+
+One generator can be stacked on top of another. This means that a generator on
+top (for example row-based events) will handle some events itself
+(eg. non-deterministic update in mixed-mode binlogging). Other events that it
+does not want to or cannot handle (for example deterministic delete or DDL)
+will be defered to the generator below (for example statement-based events).
+
+
+Materialisation (or not)
+------------------------
+
+A central decision is how to represent events that are generated in the API at
+the point of generation.
+
+I want to avoid making the API require that events are materialised. By
+"Materialised" I mean that all (or most) of the data for the event is written
+into memory in a struct/class used inside the server or serialised in a data
+buffer (byte buffer) in a format suitable for network transport or disk
+storage.
+
+Using a non-materialised event means storing just a reference to appropriate
+context that allows to retrieve all information for the event using
+accessors. Ie. typically this would be based on getting the event information
+from the THD pointer.
+
+Some reasons to avoid using materialised events in the API:
+
+ - Replication events have a _lot_ of detailed context information that can be
+   needed in events: user-defined variables, random seed, character sets,
+   table column names and types, etc. etc. If we make the API based on
+   materialisation, then the initial decision about which context information
+   to include with which events will have to be done in the API, while ideally
+   we want this decision to be done by the individual consumer plugin. There
+   will this be a conflict between what to include (to allow consumers access)
+   and what to exclude (to avoid excessive needless work).
+
+ - Materialising means defining a very specific format, which will tend to
+   make the API less generic and flexible.
+
+ - Unless the materialised format is made _very_ specific (and thus very
+   inflexible), it is unlikely to be directly useful for transport
+   (eg. binlog), so it will need to be re-materialised into a different format
+   anyway, wasting work.
+
+ - If a generator on top handles an event, then we want to avoid wasting work
+   materialising an event in a generator below which would be completely
+   unused. Thus there would be a need for the upper generator to somehow
+   notify the lower generator ahead of event generation time to not fire an
+   event, complicating the API.
+
+Some advantages for materialisation:
+
+ - Using an API based on passing around some well-defined struct event (or
+   byte buffer) will be simpler than the complex class hierarchy proposed here
+   with no requirement for materialisation.
+
+ - Defining a materialised format would allow an easy way to use the same
+   consumer code on a generator that produces events at the source of
+   execution and on a generator that produces events from eg. reading them
+   from an event log.
+
+Note that there can be some middle way, where some data is materialised and
+some is kept as reference to context (eg. THD) only. This however looses most
+of the mentioned advantages for materialisation.
+
+The design proposed here aims for as little materialisation as possible.
+
+
+Default materialisation format
+------------------------------
+
+
+While the proposed API doesn't _require_ materialisation, we can still think
+about providing the _option_ for built-in materialisation. This could be
+useful if such materialisation is made suitable for transport to a different
+server (eg. no endian-dependance etc). If there is a facility for such
+materialisation built-in to the API, it becomes possible to write something
+like a generic binlog plugin or generic network transport plugin. This would
+be really useful for eg. PBXT engine-level replication, as it could be
+implemented without having to re-invent a binlog format.
+
+I added in the proposed API a simple facility to materialise every event as a
+string of bytes. To use this, I still need to add a suitable facility to
+de-materialise the event.
+
+However, it is still an open question whether such a facility will be at all
+useful. It still has some of the problems with materialisation mentioned
+above. And  I think it is likely that a good binlog implementation will need
+to do more than just blindly copy opaque events from one endpoint to
+another. For example, it might need different event boundaries (merge and/or
+split events); it might need to augment or modify events, or inject new
+events, etc.
+
+So I think maybe it is better to add such a generic materialisation facility
+on top of the basic event generator API. Such a facility would provide
+materialisation of an replication event stream, not of individual events, so
+would be more flexible in providing a good implementation. It would be
+implemented for all generators. It would separate from both the event
+generator API (so we have flexibility to put a filter class in-between
+generator and materialisation), and could also be separate from the actual
+transport handling stuff like fsync() of binlog files and socket connections
+etc. It would be paired with a corresponding applier API which would handle
+executing events on a slave. 
+
+Then we can have a default materialised event format, which is available, but
+not mandatory. So there can still be other formats alongside (like legacy
+MySQL 5.1 binlog event format and maybe Tungsten would have its own format).
+
+
+Encapsulation
+-------------
+
+Another fundamental question about the design is the level of encapsulation
+used for the API.
+
+At the implementation level, a lot of the work is basically to pull out all of
+the needed information from the THD object/context. The API I propose tries to
+_not_ expose the THD to consumers. Instead it provides accessor functions for
+all the bits and pieces relevant to each replication event, while the event
+class itself likely will be more or less just an encapsulated THD.
+
+So an alternative would be to have a generic event that was just (type, THD).
+Then consumers could just pull out whatever information they want from the
+THD. The THD implementation is already exposed to storage engines. This would
+of course greatly reduce the size of the API, eliminating lots of class
+definitions and accessor functions. Though arguably it wouldn't really
+simplify the API, as the complexity would just be in understanding the THD
+class.
+
+Note that we do not have to take any performance hit from using encapsulated
+accessors since compilers can inline them (though if inlining then we do not
+get any ABI stability with respect to THD implemetation).
+
+For now, the API is proposed without exposing the THD class. (Similar
+encapsulation could be added in actual implementation to also not expose TABLE
+and similar classes).
 

-=-=(Knielsen - Thu, 24 Jun 2010, 12:04)=-=-
Dependency created: 107 now depends on 120

-=-=(Knielsen - Thu, 24 Jun 2010, 11:59)=-=-
High Level Description modified.
--- /tmp/wklog.120.old.516      2010-06-24 11:59:24.000000000 +0000
+++ /tmp/wklog.120.new.516      2010-06-24 11:59:24.000000000 +0000
@@ -11,4 +11,4 @@
 
 Event generators can be stacked, and a generator may defer event generation to
 the next one down the stack. For example, the row-level replication event
-generator may defer DLL to the statement-level replication event generator.
+generator may defer DDL to the statement-level replication event generator.

-=-=(Knielsen - Mon, 21 Jun 2010, 08:35)=-=-
Research and design thoughts.

Worked 2 hours and estimate 0 hours remain (original estimate increased by 2 hours).



DESCRIPTION:

A part of the replication project, MWL#107.

Events are produced by event Generators. Examples are

 - Generation of statement-based replication events
 - Generation of row-based events
 - Generation of PBXT engine-level replication events

and maybe reading of events from relay log on slave may also be an example of
generating events.

Event generators can be stacked, and a generator may defer event generation to
the next one down the stack. For example, the row-level replication event
generator may defer DDL to the statement-level replication event generator.


HIGH-LEVEL SPECIFICATION:



Generators and consumbers
-------------------------

We have the two concepts:

1. Event _generators_, that produce events describing all changes to data in a
   server.

2. Event consumers, that receive such events and use them in various ways.

Examples of event generators is execution of SQL statements, which generates
events like those used for statement-based replication. Another example is
PBXT engine-level replication.

An example of an event consumer is the writing of the binlog on a master.

Event generators are not really plugins. Rather, there are specific points in
the server where events are generated. However, a generator can be part of a
plugin, for example a PBXT engine-level replication event generator would be
part of the PBXT storage engine plugin.

Event consumers on the other hand could be a plugin.

One generator can be stacked on top of another. This means that a generator on
top (for example row-based events) will handle some events itself
(eg. non-deterministic update in mixed-mode binlogging). Other events that it
does not want to or cannot handle (for example deterministic delete or DDL)
will be defered to the generator below (for example statement-based events).


Materialisation (or not)
------------------------

A central decision is how to represent events that are generated in the API at
the point of generation.

I want to avoid making the API require that events are materialised. By
"Materialised" I mean that all (or most) of the data for the event is written
into memory in a struct/class used inside the server or serialised in a data
buffer (byte buffer) in a format suitable for network transport or disk
storage.

Using a non-materialised event means storing just a reference to appropriate
context that allows to retrieve all information for the event using
accessors. Ie. typically this would be based on getting the event information
from the THD pointer.

Some reasons to avoid using materialised events in the API:

 - Replication events have a _lot_ of detailed context information that can be
   needed in events: user-defined variables, random seed, character sets,
   table column names and types, etc. etc. If we make the API based on
   materialisation, then the initial decision about which context information
   to include with which events will have to be done in the API, while ideally
   we want this decision to be done by the individual consumer plugin. There
   will this be a conflict between what to include (to allow consumers access)
   and what to exclude (to avoid excessive needless work).

 - Materialising means defining a very specific format, which will tend to
   make the API less generic and flexible.

 - Unless the materialised format is made _very_ specific (and thus very
   inflexible), it is unlikely to be directly useful for transport
   (eg. binlog), so it will need to be re-materialised into a different format
   anyway, wasting work.

 - If a generator on top handles an event, then we want to avoid wasting work
   materialising an event in a generator below which would be completely
   unused. Thus there would be a need for the upper generator to somehow
   notify the lower generator ahead of event generation time to not fire an
   event, complicating the API.

Some advantages for materialisation:

 - Using an API based on passing around some well-defined struct event (or
   byte buffer) will be simpler than the complex class hierarchy proposed here
   with no requirement for materialisation.

 - Defining a materialised format would allow an easy way to use the same
   consumer code on a generator that produces events at the source of
   execution and on a generator that produces events from eg. reading them
   from an event log.

Note that there can be some middle way, where some data is materialised and
some is kept as reference to context (eg. THD) only. This however looses most
of the mentioned advantages for materialisation.

The design proposed here aims for as little materialisation as possible.


Default materialisation format
------------------------------


While the proposed API doesn't _require_ materialisation, we can still think
about providing the _option_ for built-in materialisation. This could be
useful if such materialisation is made suitable for transport to a different
server (eg. no endian-dependance etc). If there is a facility for such
materialisation built-in to the API, it becomes possible to write something
like a generic binlog plugin or generic network transport plugin. This would
be really useful for eg. PBXT engine-level replication, as it could be
implemented without having to re-invent a binlog format.

I added in the proposed API a simple facility to materialise every event as a
string of bytes. To use this, I still need to add a suitable facility to
de-materialise the event.

However, it is still an open question whether such a facility will be at all
useful. It still has some of the problems with materialisation mentioned
above. And  I think it is likely that a good binlog implementation will need
to do more than just blindly copy opaque events from one endpoint to
another. For example, it might need different event boundaries (merge and/or
split events); it might need to augment or modify events, or inject new
events, etc.

So I think maybe it is better to add such a generic materialisation facility
on top of the basic event generator API. Such a facility would provide
materialisation of an replication event stream, not of individual events, so
would be more flexible in providing a good implementation. It would be
implemented for all generators. It would separate from both the event
generator API (so we have flexibility to put a filter class in-between
generator and materialisation), and could also be separate from the actual
transport handling stuff like fsync() of binlog files and socket connections
etc. It would be paired with a corresponding applier API which would handle
executing events on a slave. 

Then we can have a default materialised event format, which is available, but
not mandatory. So there can still be other formats alongside (like legacy
MySQL 5.1 binlog event format and maybe Tungsten would have its own format).


Encapsulation
-------------

Another fundamental question about the design is the level of encapsulation
used for the API.

At the implementation level, a lot of the work is basically to pull out all of
the needed information from the THD object/context. The API I propose tries to
_not_ expose the THD to consumers. Instead it provides accessor functions for
all the bits and pieces relevant to each replication event, while the event
class itself likely will be more or less just an encapsulated THD.

So an alternative would be to have a generic event that was just (type, THD).
Then consumers could just pull out whatever information they want from the
THD. The THD implementation is already exposed to storage engines. This would
of course greatly reduce the size of the API, eliminating lots of class
definitions and accessor functions. Though arguably it wouldn't really
simplify the API, as the complexity would just be in understanding the THD
class.

Note that we do not have to take any performance hit from using encapsulated
accessors since compilers can inline them (though if inlining then we do not
get any ABI stability with respect to THD implemetation).

For now, the API is proposed without exposing the THD class. (Similar
encapsulation could be added in actual implementation to also not expose TABLE
and similar classes).


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)