← Back to team overview

maria-developers team mailing list archive

Updated (by Alexi): Store in binlog text of statements that caused RBR events (47)

 

-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Store in binlog text of statements that caused RBR events
CREATION DATE..: Sat, 15 Aug 2009, 23:48
SUPERVISOR.....: Monty
IMPLEMENTOR....: 
COPIES TO......: Knielsen
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 47 (http://askmonty.org/worklog/?tid=47)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 20
ESTIMATE.......: 35 (hours remain)
ORIG. ESTIMATE.: 35

PROGRESS NOTES:

-=-=(Alexi - Sun, 20 Dec 2009, 09:29)=-=-
Low Level Design modified.
--- /tmp/wklog.47.old.32726     2009-12-20 07:29:56.000000000 +0000
+++ /tmp/wklog.47.new.32726     2009-12-20 07:29:56.000000000 +0000
@@ -56,7 +56,7 @@
     0000006A | 00          |   0 - RESERVED [27]
     .....                      ...
     00000081 | 00          |   0 - RESERVED [50]
-    00000082 | 00          |   0 - ANNOTATE_RBR_EVENT [51]
+    00000082 | 00          |   0 - ANNOTATE_ROWS_EVENT [51]
     ************************
 
 2. Outline of Annotate_rows event behavior

-=-=(Alexi - Sat, 19 Dec 2009, 16:10)=-=-
Low Level Design modified.
--- /tmp/wklog.47.old.16051     2009-12-19 16:10:48.000000000 +0200
+++ /tmp/wklog.47.new.16051     2009-12-19 16:10:48.000000000 +0200
@@ -253,7 +253,7 @@
 thd query to that of the described by the event, i.e. to the query which
 caused the subsequent Rows events (see "How Master writes Annotate_rows
 events to the binary log" to follow what happens further when the subsequent
-Rows events is applied):
+Rows events are applied):
 
     int Annotate_rows_log_event::do_apply_event(...)
     {

-=-=(Alexi - Sat, 19 Dec 2009, 16:02)=-=-
High-Level Specification modified.
--- /tmp/wklog.47.old.15695     2009-12-19 16:02:33.000000000 +0200
+++ /tmp/wklog.47.new.15695     2009-12-19 16:02:33.000000000 +0200
@@ -12,7 +12,7 @@
 post-header and contains the query text in its data part. Example:
 
     ************************
-    ANNOTATE_RBR_EVENT
+    ANNOTATE_ROWS_EVENT
     ************************
     00000220 | B6 A0 2C 4B | time_when  = 1261215926
     00000224 | 33          | event_type = 51

-=-=(Alexi - Sat, 19 Dec 2009, 15:58)=-=-
Low Level Design modified.
--- /tmp/wklog.47.old.15437     2009-12-19 13:58:12.000000000 +0000
+++ /tmp/wklog.47.new.15437     2009-12-19 13:58:12.000000000 +0000
@@ -1 +1,337 @@
+Content
+~~~~~~~
+    1. Annotate_rows event number
+    2. Outline of Annotate_rows event behavior
+    3. How Master writes Annotate_rows events to the binary log
+    4. How slave treats replicate-annotate-rows-events option
+    5. How slave IO thread requests Annotate_rows events
+    6. How master executes the request
+    7. How slave SQL thread processes Annotate_rows events
+    8. General remarks
+
+1. Annotate_rows event number
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
+between the last MySQL event number and the Annotate_rows event number:
+
+    enum Log_event_type
+    { ...
+      INCIDENT_EVENT= 26,
+      // New MySQL event numbers are to be added here
+      MYSQL_EVENTS_END,
+  
+      MARIA_EVENTS_BEGIN= 51,
+      // New Maria event numbers start from here
+      ANNOTATE_ROWS_EVENT= 51,
+    
+      ENUM_END_EVENT
+    };
+
+together with the corresponding extension of 'post_header_len' array in the
+Format description event. (This extension does not affect the compatibility
+of the binary log). Here is how Format description event looks like with
+this extension:
+
+    ************************
+    FORMAT_DESCRIPTION_EVENT
+    ************************
+    00000004 | A1 A0 2C 4B | time_when  = 1261215905
+    00000008 | 0F          | event_type = 15
+    00000009 | 64 00 00 00 | server_id  = 100
+    0000000D | 7F 00 00 00 | event_len  = 127
+    00000011 | 83 00 00 00 | log_pos    = 00000083
+    00000015 | 01 00       | flags      = LOG_EVENT_BINLOG_IN_USE_F
+    ------------------------
+    00000017 | 04 00       | binlog_ver = 4
+    00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
+    .....                    ...
+    0000004B | A1 A0 2C 4B | time_created = 1261215905
+    0000004F | 13          | common_header_len = 19
+    ------------------------
+    post_header_len
+    ------------------------
+    00000050 | 38          |  56 - START_EVENT_V3 [1]
+    .....                     ...
+    00000069 | 02          |   2 - INCIDENT_EVENT [26]
+    0000006A | 00          |   0 - RESERVED [27]
+    .....                      ...
+    00000081 | 00          |   0 - RESERVED [50]
+    00000082 | 00          |   0 - ANNOTATE_RBR_EVENT [51]
+    ************************
+
+2. Outline of Annotate_rows event behavior
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Each Annotate_rows_log_event object has two private members describing the
+corresponding query:
+
+    char *m_query_txt;
+    uint  m_query_len;
+
+When the object is created for writing to a binary log, this query is taken
+from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
+as well as other implementation details):
+
+    Annotate_rows_log_event(THD *thd)
+    {
+      m_query_txt = thd->query();
+      m_query_len = thd->query_length();
+    }
+
+When the object is read from a binary log, the query is taken from the buffer
+containing the binary log representation of the event (this buffer is allocated
+in Log_event object from which all Log events are derived):
+
+    Annotate_rows_log_event(char *buf, uint event_len,
+                            Format_description_log_event *desc)
+    {
+      m_query_len = event_len - desc->common_header_len;
+      m_query_txt = buf + desc->common_header_len;
+    }
+
+The events are written to the binary log by the Log_event::write() member
+which calls virtual write_data_header() and write_data_body() members
+("data header" and "post header" are synonym in replication terminology).
+In our case, data header is empty and data body is just the query:
+
+    bool write_data_body(IO_CACHE *file)
+    {
+      return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
+    }
+
+Printing the event is just printing the query:
+
+    void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
+    {
+      my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
+    }
+
+3. How Master writes Annotate_rows events to the binary log
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The event is written to the binary log just before the group of Table_map
+events which precede corresponding Rows events (one query may generate
+several Table map events in the binary log, but the corresponding
+Annotate_rows event must be written only once before the first Table map
+event; hence the boolean variable 'with_annotate' below):
+
+    int write_locked_table_maps(THD *thd)
+    { ...
+      bool with_annotate= thd->variables.binlog_annotate_rows_events;
+      ...
+      for (uint i= 0; i < ... <number of tables> ...; ++i)
+      { ...
+        thd->binlog_write_table_map(table, ..., with_annotate);
+        with_annotate= 0; // write Annotate_event not more than once
+        ...
+      }
+      ...
+    }
+
+    int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
+    { ...
+      Table_map_log_event the_event(...);
+      ...  
+      if (with_annotate)
+      {
+        Annotate_rows_log_event anno(this);
+        mysql_bin_log.write(&anno);
+      }
+
+      mysql_bin_log.write(&the_event);
+      ...
+    }
+
+4. How slave treats replicate-annotate-rows-events option
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The replicate-annotate-rows-events option is treated just as the session
+value of the binlog_annotate_rows_events variable for the slave IO and
+SQL threads. This setting is done during initialization of these threads:
+
+    pthread_handler_t handle_slave_io(void *arg)
+    {
+      THD *thd= new THD;
+      ...
+      init_slave_thread(thd, SLAVE_THD_IO);
+      ...
+    }
+
+    pthread_handler_t handle_slave_sql(void *arg)
+    {
+      THD *thd= new THD;
+      ...
+      init_slave_thread(thd, SLAVE_THD_SQL);
+      ...
+    }
+
+    int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
+    { ...
+      thd->variables.binlog_annotate_rows_events=
+                  opt_replicate_annotate_rows_events;
+      ...
+    }
+
+5. How slave IO thread requests Annotate_rows events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When requesting an event, the slave should inform the master whether
+it should send Annotate_rows events or not. To that end we add a new
+BINLOG_SEND_ANNOTATE_ROWS_EVENT flag used when requesting an event:
+
+    #define BINLOG_DUMP_NON_BLOCK             1
+    #define BINLOG_SEND_ANNOTATE_ROWS_EVENT   2
+
+    pthread_handler_t handle_slave_io(void *arg)
+    { ...
+      request_dump(mysql, ...);
+      ...
+    }
+
+    int request_dump(MYSQL* mysql, ...)
+    { ...  
+      if (opt_log_slave_updates &&
+          mi->io_thd->variables.binlog_annotate_rows_events)
+        binlog_flags|= BINLOG_SEND_ANNOTATE_ROWS_EVENT;
+      ...
+      int2store(buf + 4, binlog_flags);
+      ...
+      simple_command(mysql, COM_BINLOG_DUMP, buf, ...);
+      ...
+    }
+
+6. How master executes the request
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+    case COM_BINLOG_DUMP:
+      { ...
+        flags= uint2korr(packet + 4);
+        ...
+        mysql_binlog_send(thd, ..., flags);
+        ...
+      }
+
+    void mysql_binlog_send(THD* thd, ..., ushort flags)
+    { ...
+      Log_event::read_log_event(&log, packet, ...);
+      ...
+      if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
+          flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT ||
+          thd->server_id == 0 /* slave == mysqlbinlog */ )
+      {
+        my_net_write(net, packet->ptr(), packet->length());
+      }
+      ...
+    }
+
+7. How slave SQL thread processes Annotate_rows events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The slave processes each recieved event by "applying" it, i.e. by
+calling the Log_event::apply_event() function which in turn calls
+the virtual do_apply_event() member specific for each type of the
+event.
+
+    int exec_relay_log_event(THD* thd, Relay_log_info* rli)
+    { ...
+      Log_event *ev = next_event(rli);
+      ...
+      apply_event_and_update_pos(ev, ...);
+      
+      if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
+        delete ev;
+      ...      
+    }
+    
+    int apply_event_and_update_pos(Log_event *ev, ...)
+    { ...
+      ev->apply_event(...);
+      ...
+    }
+    
+    int Log_event::apply_event(...)
+    {
+      return do_apply_event(...);
+    }
+
+What does it mean to "apply" an Annotate_rows event? It means to set current
+thd query to that of the described by the event, i.e. to the query which
+caused the subsequent Rows events (see "How Master writes Annotate_rows
+events to the binary log" to follow what happens further when the subsequent
+Rows events is applied):
+
+    int Annotate_rows_log_event::do_apply_event(...)
+    {
+      thd->set_query(m_query_txt, m_query_len);
+    }
+
+NOTE. I am not sure, but possibly current values of thd->query and
+thd->query_length should be saved before calling set_query() and to be
+restored on the Annotate_rows_log_event object deletion.
+Is it really needed ?
+
+After calling this do_apply_event() function we may not delete the
+Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
+above) because thd->query now points to the string inside this object.
+We may keep the pointer to this object in the Relay_log_info:
+
+    class Relay_log_info
+    {
+    public:
+      ...
+      void set_annotate_event(Annotate_rows_log_event*);
+      Annotate_rows_log_event* get_annotate_event();
+      void free_annotate_event();
+      ...
+    private:
+      Annotate_rows_log_event* m_annotate_event;
+    };
+
+When the saved Annotate_rows object may be deleted? When all corresponding
+Rows events will be processed, i.e. before processing the first non-Rows
+event (note that Annotate_rows object resides in the binary log *after*
+the (possible) 'BEGIN' Query event which accompanies the rows events; note
+also that this deletion is adjusted with the case when some or all
+corresponding Rows events are filtered out by replicate filter rules):
+
+    int exec_relay_log_event(THD* thd, Relay_log_info* rli)
+    { ...
+      Log_event *ev= next_event(rli);
+      ...
+      if (rli->get_annotate_event() && !IS_RBR_EVENT_TYPE(ev->get_type_code()))
+        rli->free_annotate_event();
+
+      apply_event_and_update_pos(ev, ...);
+      
+      if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
+        rli->set_annotate_event((Annotate_rows_log_event*) ev);
+      else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
+        delete ev;
+      ...      
+    }
+
+where
+
+    #define IS_RBR_EVENT_TYPE(type) ( (type) == TABLE_MAP_EVENT   ||  \
+                                      (type) == WRITE_ROWS_EVENT  ||  \
+                                      (type) == UPDATE_ROWS_EVENT ||  \
+                                      (type) == DELETE_ROWS_EVENT )
+
+8. General remarks
+~~~~~~~~~~~~~~~~~~
+Kristian noticed that introducing new log event type should be coordinated
+somehow with MySQL/Sun:
+
+    Kristian: The numeric code for this event must be assigned carefully.
+    It should be coordinated with MySQL/Sun, otherwise we can get into a
+    situation where MySQL uses the same numeric code for one event that
+    MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
+    impossible.
+    Alex: I reserved about 20 numbers not to have possible conflicts
+    with MySQL.
+    Kristian: Still, I think it would be appropriate to send a polite email
+    to internals@xxxxxxxxxxxxxxx about this and suggesting to reserve the
+    event number.
+
+Also we should notice the introduction of the BINLOG_SEND_ANNOTATE_ROWS_EVENT
+flag taking into account that MySQL/Sun may also introduce a flag with the
+same value to be used in the request_dump-mysql_binlog_send interface.
+But this is mainly the question of merging: if a conflict concerning this
+flag occur, we may simply change the BINLOG_SEND_ANNOTATE_ROWS_EVENT value
+(this does not require additional changes in the code).
 

-=-=(Alexi - Sat, 19 Dec 2009, 15:41)=-=-
High-Level Specification modified.
--- /tmp/wklog.47.old.14545     2009-12-19 15:41:21.000000000 +0200
+++ /tmp/wklog.47.new.14545     2009-12-19 15:41:21.000000000 +0200
@@ -1,122 +1,107 @@
-First suggestion:
-
-> I think for this we would actually need a new binlog event type
-> (Comment_log_event?). Unless we want to log an empty statement Query_log_event
-> containing only a comment (a bit of a hack).
-
-New server option
-~~~~~~~~~~~~~~~~~
-  --binlog-annotate-rows-events
-  
-Setting this option makes RBR (rows-) events in the binary log to be
-preceded by Annotate rows events (see below). The corresponding
-'binlog_annotate_rows_events' system variable is dynamic and has both
-global and session values. Default global value is OFF.
-
-Note. Session values are usefull to make it possible to annotate only
-      some selected statements:
+Content
+~~~~~~~
+    1. Annotate_rows_log_event
+    2. Server option: --binlog-annotate-rows-events
+    3. Server option: --replicate-annotate-rows-events
+    4. mysqlbinlog option: --print-annotate-rows-events
+    5. mysqlbinlog output
+
+1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Describes the query which caused the corresponding rows events. Has empty
+post-header and contains the query text in its data part. Example:
+
+    ************************
+    ANNOTATE_RBR_EVENT
+    ************************
+    00000220 | B6 A0 2C 4B | time_when  = 1261215926
+    00000224 | 33          | event_type = 51
+    00000225 | 64 00 00 00 | server_id  = 100
+    00000229 | 36 00 00 00 | event_len  = 54
+    0000022D | 56 02 00 00 | log_pos    = 00000256
+    00000231 | 00 00       | flags      = <none>
+    ------------------------
+    00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
+    00000237 | 52 54 20 49 |
+    0000023B | 4E 54 4F 20 |
+    0000023F | 74 31 20 56 |
+    00000243 | 41 4C 55 45 |
+    00000247 | 53 20 28 31 |
+    0000024B | 29 2C 20 28 |
+    0000024F | 32 29 2C 20 |
+    00000253 | 28 33 29    | 
+    ************************
+
+In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
+and precedes the first of Table map events which accompany the corresponding
+rows events. (See example in the "mysqlbinlog output" section below.)
+
+2. Server option: --binlog-annotate-rows-events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Tells the master to write Annotate_rows events to the binary log.
+
+    * Variable Name: binlog_annotate_rows_events
+    * Scope: Global & Session
+    * Access Type: Dynamic
+    * Data Type: bool
+    * Default Value: OFF
 
+NOTE. Session values allows to annotate only some selected statements:
         ...
         SET SESSION binlog_annotate_rows_events=ON;
         ... statements to be annotated ...
         SET SESSION binlog_annotate_rows_events=OFF;
         ... statements not to be annotated ...
 
-New binlog event type
-~~~~~~~~~~~~~~~~~~~~~
-  Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
-
-Describes the query which caused the corresponding rows event. In binary log,
-precedes each Table_map_log_event. Contains empty post-header and the query
-text in its data part.
-
-The numeric code for this event must be assigned carefully. It should be
-coordinated with MySQL/Sun, otherwise we can get into a situation where MySQL
-uses the same numeric code for one event that MariaDB uses for
-ANNOTATE_ROWS_EVENT, which would make merging the two impossible.
-
-Example:
-
-  ...
-  ************************
-  ANNOTATE_ROWS_EVENT [51]
-  ************************
-  000000C7 | 54 1B 12 4B | time_when  = 1259477844
-  000000CB | 33          | event_type = 51
-  000000CC | 64 00 00 00 | server_id  = 100
-  000000D0 | 2C 00 00 00 | event_len  = 44
-  000000D4 | F3 00 00 00 | log_pos    = 000000F3
-  000000D8 | 00 00       | flags      = <none>
-  ------------------------
-  000000DA | 69 6E 73 65 | query = "insert into t1 values (1)"
-  000000DE | 72 74 20 69 |
-  000000E2 | 6E 74 6F 20 |
-  000000E6 | 74 31 20 76 |
-  000000EA | 61 6C 75 65 |
-  000000EE | 73 20 28 31 |
-  000000F2 | 29          | 
-  ************************
-  TABLE_MAP_EVENT [19]
-  ************************
-  000000F3 | 54 1B 12 4B | time_when  = 1259477844
-  000000F7 | 13          | event_type = 19
-  000000F8 | 64 00 00 00 | server_id  = 100
-  000000FC | 29 00 00 00 | event_len  = 41
-  00000100 | 1C 01 00 00 | log_pos    = 0000011C
-  00000104 | 00 00       | flags      = <none>
-  ------------------------
-  ...
-  ************************
-  WRITE_ROWS_EVENT [23]
-  ************************
-  0000011C | 54 1B 12 4B | time_when  = 1259477844
-  00000120 | 17          | event_type = 23
-  00000121 | 64 00 00 00 | server_id  = 100
-  00000125 | 22 00 00 00 | event_len  = 34
-  00000129 | 3E 01 00 00 | log_pos    = 0000013E
-  0000012D | 10 00       | flags      = LOG_EVENT_UPDATE_TABLE_MAP_VERSION_F
-  ------------------------
-  0000012F | 0F 00 00 00 | table_id = 15
-  ...
-
-New mysqlbinlog option
-~~~~~~~~~~~~~~~~~~~~~~
-  --print-annotate-rows-events
-
-With this option, mysqlbinlog prints the content of Annotate-rows 
-events (if the binary log does contain them). Without this option
-(i.e. by default), mysqlbinlog skips Annotate rows events.
-
-
-mysqlbinlog output
-~~~~~~~~~~~~~~~~~~
-Something like this:
+3. Server option: --replicate-annotate-rows-events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Tells the slave to reproduce Annotate_rows events recieved from the master
+in its own binary log (sensible only in pair with log-slave-updates option).
+
+    * Variable Name: replicate_annotate_rows_events
+    * Scope: Global
+    * Access Type: Read only
+    * Data Type: bool
+    * Default Value: OFF
+
+NOTE. Why do we additionally need this 'replicate' option? Why not to make
+the slave to reproduce this events when its binlog-annotate-rows-events
+global value is ON? Well, because, for example, we may want to configure
+the slave which should reproduce Annotate_rows events but has global
+binlog-annotate-rows-events = OFF meaning this to be the default value for
+the client threads (see also "How slave treats replicate-annotate-rows-events
+option" in LLD part).
+
+4. mysqlbinlog option: --print-annotate-rows-events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+With this option, mysqlbinlog prints the content of Annotate_rows events (if
+the binary log does contain them). Without this option (i.e. by default),
+mysqlbinlog skips Annotate_rows events.
 
+5. mysqlbinlog output
+~~~~~~~~~~~~~~~~~~~~~
+With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
+in a form like this:
   ...
-  # at 199
-  # at 243
-  # at 284
-  #091129  9:57:24 server id 100  end_log_pos 243      Query: `insert into t1 values
-(1)`
-  #091129  9:57:24 server id 100  end_log_pos 284      Table_map: `test`.`t1` mapped
-to number 15
-  #091129  9:57:24 server id 100  end_log_pos 318      Write_rows: table id 15
+    # at 1646
+    #091219 12:45:26 server id 100  end_log_pos 1714   Query   thread_id=1
+exec_time=0    error_code=0
+    SET TIMESTAMP=1261215926/*!*/;
+    BEGIN
+    /*!*/;
+    # at 1714
+    # at 1812
+    # at 1853
+    # at 1894
+    # at 1938
+    #091219 12:45:26 server id 100  end_log_pos 1812   Query: `DELETE t1, t2 FROM
+t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
+    #091219 12:45:26 server id 100  end_log_pos 1853   Table_map: `test`.`t1`
+mapped to number 16
+    #091219 12:45:26 server id 100  end_log_pos 1894   Table_map: `test`.`t2`
+mapped to number 17
+    #091219 12:45:26 server id 100  end_log_pos 1938   Delete_rows: table id 16
+    #091219 12:45:26 server id 100  end_log_pos 1982   Delete_rows: table id 17
 flags: STMT_END_F
-
-  BINLOG '
-  VBsSSzNkAAAALAAAAPMAAAAAAGluc2VydCBpbnRvIHQxIHZhbHVlcyAoMSk=
-  VBsSSxNkAAAAKQAAABwBAAAAAA8AAAAAAAAABHRlc3QAAnQxAAEDAAE=
-  VBsSSxdkAAAAIgAAAD4BAAAQAA8AAAAAAAEAAf/+AQAAAA==
-  '/*!*/;
-  ### INSERT INTO test.t1
-  ### SET
-  ###   @1=1 /* INT meta=0 nullable=1 is_null=0 */
   ...
 
-When master sends Annotate rows events
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-1. Master always sends Annotate_rows events to mysqlbinlog (in
-   remote case).
-2. Master sends Annotate_rows events to a slave only if the slave has
-   both log-slave-updates and binlog-annotate-rows-events options set.
-

-=-=(Bothorsen - Fri, 18 Dec 2009, 16:22)=-=-
Add estimation time.

Worked 5 hours and estimate 35 hours remain (original estimate increased by 5 hours).

-=-=(Bothorsen - Fri, 18 Dec 2009, 16:16)=-=-
This is the work done on this patch so far. Most of it done by Alex.

Worked 15 hours and estimate 035 hours remain (original estimate increased by 50 hours).

-=-=(Alexi - Fri, 04 Dec 2009, 13:00)=-=-
High-Level Specification modified.
--- /tmp/wklog.47.old.14001     2009-12-04 13:00:24.000000000 +0200
+++ /tmp/wklog.47.new.14001     2009-12-04 13:00:24.000000000 +0200
@@ -6,27 +6,27 @@
 
 New server option
 ~~~~~~~~~~~~~~~~~
-  --binlog-annotate-row-events
+  --binlog-annotate-rows-events
   
-Setting this option makes RBR (row-) events in the binary log to be
+Setting this option makes RBR (rows-) events in the binary log to be
 preceded by Annotate rows events (see below). The corresponding
-'binlog_annotate_row_events' system variable is dynamic and has both
+'binlog_annotate_rows_events' system variable is dynamic and has both
 global and session values. Default global value is OFF.
 
 Note. Session values are usefull to make it possible to annotate only
       some selected statements:
 
         ...
-        SET SESSION binlog_annotate_row_events=ON;
+        SET SESSION binlog_annotate_rows_events=ON;
         ... statements to be annotated ...
-        SET SESSION binlog_annotate_row_events=OFF;
+        SET SESSION binlog_annotate_rows_events=OFF;
         ... statements not to be annotated ...
 
 New binlog event type
 ~~~~~~~~~~~~~~~~~~~~~
   Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
 
-Describes the query which caused the corresponding row event. In binary log,
+Describes the query which caused the corresponding rows event. In binary log,
 precedes each Table_map_log_event. Contains empty post-header and the query
 text in its data part.
 
@@ -79,6 +79,15 @@
   0000012F | 0F 00 00 00 | table_id = 15
   ...
 
+New mysqlbinlog option
+~~~~~~~~~~~~~~~~~~~~~~
+  --print-annotate-rows-events
+
+With this option, mysqlbinlog prints the content of Annotate-rows 
+events (if the binary log does contain them). Without this option
+(i.e. by default), mysqlbinlog skips Annotate rows events.
+
+
 mysqlbinlog output
 ~~~~~~~~~~~~~~~~~~
 Something like this:
@@ -109,5 +118,5 @@
 1. Master always sends Annotate_rows events to mysqlbinlog (in
    remote case).
 2. Master sends Annotate_rows events to a slave only if the slave has
-   both log-slave-updates and binlog-annotate-row-events options set.
+   both log-slave-updates and binlog-annotate-rows-events options set.
 

-=-=(Alexi - Wed, 02 Dec 2009, 13:32)=-=-
Low Level Design modified.
--- /tmp/wklog.47.old.17456     2009-12-02 13:32:18.000000000 +0200
+++ /tmp/wklog.47.new.17456     2009-12-02 13:32:18.000000000 +0200
@@ -1,8 +1 @@
-mysql_binlog_send() [sql/sql_repl.cc]
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-1. When sending events to a slave, master should simply skip
-   Annotate_rows events (they are not needed for replication).
-   [ ??? Multi-master - currently not clear ]
-2. When sending events to mysqlbinlog (remote case), master
-   must send Annotate_rows events as well.
 

-=-=(Alexi - Wed, 02 Dec 2009, 13:31)=-=-
High-Level Specification modified.
--- /tmp/wklog.47.old.17414     2009-12-02 11:31:59.000000000 +0000
+++ /tmp/wklog.47.new.17414     2009-12-02 11:31:59.000000000 +0000
@@ -104,3 +104,10 @@
   ###   @1=1 /* INT meta=0 nullable=1 is_null=0 */
   ...
 
+When master sends Annotate rows events
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+1. Master always sends Annotate_rows events to mysqlbinlog (in
+   remote case).
+2. Master sends Annotate_rows events to a slave only if the slave has
+   both log-slave-updates and binlog-annotate-row-events options set.
+

	------------------------------------------------------------

		-=-=(View All Progress Notes, 17 total)=-=-
	http://askmonty.org/worklog/index.pl?tid=47&nolimit=1


DESCRIPTION:

Store in binlog (and show in mysqlbinlog output) texts of statements that 
caused RBR events

This is needed for (list from Monty):
- Easier to understand why updates happened
- Would make it easier to find out where in application things went
  wrong (as you can search for exact strings)
- Allow one to filter things based on comments in the statement.

The cost of this can be that the binlog will be approximately 2x in size 
(especially insert of big blob's would be a bit painful), so this should 
be an optional feature.


HIGH-LEVEL SPECIFICATION:



Content
~~~~~~~
    1. Annotate_rows_log_event
    2. Server option: --binlog-annotate-rows-events
    3. Server option: --replicate-annotate-rows-events
    4. mysqlbinlog option: --print-annotate-rows-events
    5. mysqlbinlog output

1. Annotate_rows_log_event [ ANNOTATE_ROWS_EVENT ]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Describes the query which caused the corresponding rows events. Has empty
post-header and contains the query text in its data part. Example:

    ************************
    ANNOTATE_ROWS_EVENT
    ************************
    00000220 | B6 A0 2C 4B | time_when  = 1261215926
    00000224 | 33          | event_type = 51
    00000225 | 64 00 00 00 | server_id  = 100
    00000229 | 36 00 00 00 | event_len  = 54
    0000022D | 56 02 00 00 | log_pos    = 00000256
    00000231 | 00 00       | flags      = <none>
    ------------------------
    00000233 | 49 4E 53 45 | query = "INSERT INTO t1 VALUES (1), (2), (3)"
    00000237 | 52 54 20 49 |
    0000023B | 4E 54 4F 20 |
    0000023F | 74 31 20 56 |
    00000243 | 41 4C 55 45 |
    00000247 | 53 20 28 31 |
    0000024B | 29 2C 20 28 |
    0000024F | 32 29 2C 20 |
    00000253 | 28 33 29    | 
    ************************

In binary log, Annotate_rows event follows the (possible) 'BEGIN' Query event
and precedes the first of Table map events which accompany the corresponding
rows events. (See example in the "mysqlbinlog output" section below.)

2. Server option: --binlog-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the master to write Annotate_rows events to the binary log.

    * Variable Name: binlog_annotate_rows_events
    * Scope: Global & Session
    * Access Type: Dynamic
    * Data Type: bool
    * Default Value: OFF

NOTE. Session values allows to annotate only some selected statements:
    ...
    SET SESSION binlog_annotate_rows_events=ON;
    ... statements to be annotated ...
    SET SESSION binlog_annotate_rows_events=OFF;
    ... statements not to be annotated ...

3. Server option: --replicate-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tells the slave to reproduce Annotate_rows events recieved from the master
in its own binary log (sensible only in pair with log-slave-updates option).

    * Variable Name: replicate_annotate_rows_events
    * Scope: Global
    * Access Type: Read only
    * Data Type: bool
    * Default Value: OFF

NOTE. Why do we additionally need this 'replicate' option? Why not to make
the slave to reproduce this events when its binlog-annotate-rows-events
global value is ON? Well, because, for example, we may want to configure
the slave which should reproduce Annotate_rows events but has global
binlog-annotate-rows-events = OFF meaning this to be the default value for
the client threads (see also "How slave treats replicate-annotate-rows-events
option" in LLD part).

4. mysqlbinlog option: --print-annotate-rows-events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With this option, mysqlbinlog prints the content of Annotate_rows events (if
the binary log does contain them). Without this option (i.e. by default),
mysqlbinlog skips Annotate_rows events.

5. mysqlbinlog output
~~~~~~~~~~~~~~~~~~~~~
With --print-annotate-rows-events, mysqlbinlog outputs Annotate_rows events
in a form like this:
    ...
    # at 1646
    #091219 12:45:26 server id 100  end_log_pos 1714    Query   thread_id=1
exec_time=0     error_code=0
    SET TIMESTAMP=1261215926/*!*/;
    BEGIN
    /*!*/;
    # at 1714
    # at 1812
    # at 1853
    # at 1894
    # at 1938
    #091219 12:45:26 server id 100  end_log_pos 1812    Query: `DELETE t1, t2 FROM
t1 INNER JOIN t2 INNER JOIN t3 WHERE t1.a=t2.a AND t2.a=t3.a`
    #091219 12:45:26 server id 100  end_log_pos 1853    Table_map: `test`.`t1`
mapped to number 16
    #091219 12:45:26 server id 100  end_log_pos 1894    Table_map: `test`.`t2`
mapped to number 17
    #091219 12:45:26 server id 100  end_log_pos 1938    Delete_rows: table id 16
    #091219 12:45:26 server id 100  end_log_pos 1982    Delete_rows: table id 17
flags: STMT_END_F
    ...


LOW-LEVEL DESIGN:



Content
~~~~~~~
    1. Annotate_rows event number
    2. Outline of Annotate_rows event behavior
    3. How Master writes Annotate_rows events to the binary log
    4. How slave treats replicate-annotate-rows-events option
    5. How slave IO thread requests Annotate_rows events
    6. How master executes the request
    7. How slave SQL thread processes Annotate_rows events
    8. General remarks

1. Annotate_rows event number
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To avoid possible event numbers conflict with MySQL/Sun, we leave a gap
between the last MySQL event number and the Annotate_rows event number:

    enum Log_event_type
    { ...
      INCIDENT_EVENT= 26,
      // New MySQL event numbers are to be added here
      MYSQL_EVENTS_END,
  
      MARIA_EVENTS_BEGIN= 51,
      // New Maria event numbers start from here
      ANNOTATE_ROWS_EVENT= 51,
    
      ENUM_END_EVENT
    };

together with the corresponding extension of 'post_header_len' array in the
Format description event. (This extension does not affect the compatibility
of the binary log). Here is how Format description event looks like with
this extension:

    ************************
    FORMAT_DESCRIPTION_EVENT
    ************************
    00000004 | A1 A0 2C 4B | time_when  = 1261215905
    00000008 | 0F          | event_type = 15
    00000009 | 64 00 00 00 | server_id  = 100
    0000000D | 7F 00 00 00 | event_len  = 127
    00000011 | 83 00 00 00 | log_pos    = 00000083
    00000015 | 01 00       | flags      = LOG_EVENT_BINLOG_IN_USE_F
    ------------------------
    00000017 | 04 00       | binlog_ver = 4
    00000019 | 35 2E 32 2E | server_ver = 5.2.0-MariaDB-alpha-debug-log
    .....                    ...
    0000004B | A1 A0 2C 4B | time_created = 1261215905
    0000004F | 13          | common_header_len = 19
    ------------------------
    post_header_len
    ------------------------
    00000050 | 38          |  56 - START_EVENT_V3 [1]
    .....                     ...
    00000069 | 02          |   2 - INCIDENT_EVENT [26]
    0000006A | 00          |   0 - RESERVED [27]
    .....                      ...
    00000081 | 00          |   0 - RESERVED [50]
    00000082 | 00          |   0 - ANNOTATE_ROWS_EVENT [51]
    ************************

2. Outline of Annotate_rows event behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each Annotate_rows_log_event object has two private members describing the
corresponding query:

    char *m_query_txt;
    uint  m_query_len;

When the object is created for writing to a binary log, this query is taken
from 'thd' (for short, below we omit the 'Annotate_rows_log_event::' prefix
as well as other implementation details):

    Annotate_rows_log_event(THD *thd)
    {
      m_query_txt = thd->query();
      m_query_len = thd->query_length();
    }

When the object is read from a binary log, the query is taken from the buffer
containing the binary log representation of the event (this buffer is allocated
in Log_event object from which all Log events are derived):

    Annotate_rows_log_event(char *buf, uint event_len,
                            Format_description_log_event *desc)
    {
      m_query_len = event_len - desc->common_header_len;
      m_query_txt = buf + desc->common_header_len;
    }

The events are written to the binary log by the Log_event::write() member
which calls virtual write_data_header() and write_data_body() members
("data header" and "post header" are synonym in replication terminology).
In our case, data header is empty and data body is just the query:

    bool write_data_body(IO_CACHE *file)
    {
      return my_b_safe_write(file, (uchar*) m_query_txt, m_query_len);
    }

Printing the event is just printing the query:

    void Annotate_rows_log_event::print(FILE *file, PRINT_EVENT_INFO *pinfo)
    {
      my_b_printf(&pinfo->head_cache, "\tQuery: `%s`\n", m_query_txt);
    }

3. How Master writes Annotate_rows events to the binary log
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The event is written to the binary log just before the group of Table_map
events which precede corresponding Rows events (one query may generate
several Table map events in the binary log, but the corresponding
Annotate_rows event must be written only once before the first Table map
event; hence the boolean variable 'with_annotate' below):

    int write_locked_table_maps(THD *thd)
    { ...
      bool with_annotate= thd->variables.binlog_annotate_rows_events;
      ...
      for (uint i= 0; i < ... <number of tables> ...; ++i)
      { ...
        thd->binlog_write_table_map(table, ..., with_annotate);
        with_annotate= 0; // write Annotate_event not more than once
        ...
      }
      ...
    }

    int THD::binlog_write_table_map(TABLE *table, ..., bool with_annotate)
    { ...
      Table_map_log_event the_event(...);
      ...  
      if (with_annotate)
      {
        Annotate_rows_log_event anno(this);
        mysql_bin_log.write(&anno);
      }

      mysql_bin_log.write(&the_event);
      ...
    }

4. How slave treats replicate-annotate-rows-events option
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The replicate-annotate-rows-events option is treated just as the session
value of the binlog_annotate_rows_events variable for the slave IO and
SQL threads. This setting is done during initialization of these threads:

    pthread_handler_t handle_slave_io(void *arg)
    {
      THD *thd= new THD;
      ...
      init_slave_thread(thd, SLAVE_THD_IO);
      ...
    }

    pthread_handler_t handle_slave_sql(void *arg)
    {
      THD *thd= new THD;
      ...
      init_slave_thread(thd, SLAVE_THD_SQL);
      ...
    }

    int init_slave_thread(THD* thd, SLAVE_THD_TYPE thd_type)
    { ...
      thd->variables.binlog_annotate_rows_events=
                  opt_replicate_annotate_rows_events;
      ...
    }

5. How slave IO thread requests Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When requesting an event, the slave should inform the master whether
it should send Annotate_rows events or not. To that end we add a new
BINLOG_SEND_ANNOTATE_ROWS_EVENT flag used when requesting an event:

    #define BINLOG_DUMP_NON_BLOCK             1
    #define BINLOG_SEND_ANNOTATE_ROWS_EVENT   2

    pthread_handler_t handle_slave_io(void *arg)
    { ...
      request_dump(mysql, ...);
      ...
    }

    int request_dump(MYSQL* mysql, ...)
    { ...  
      if (opt_log_slave_updates &&
          mi->io_thd->variables.binlog_annotate_rows_events)
        binlog_flags|= BINLOG_SEND_ANNOTATE_ROWS_EVENT;
      ...
      int2store(buf + 4, binlog_flags);
      ...
      simple_command(mysql, COM_BINLOG_DUMP, buf, ...);
      ...
    }

6. How master executes the request
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    case COM_BINLOG_DUMP:
      { ...
        flags= uint2korr(packet + 4);
        ...
        mysql_binlog_send(thd, ..., flags);
        ...
      }

    void mysql_binlog_send(THD* thd, ..., ushort flags)
    { ...
      Log_event::read_log_event(&log, packet, ...);
      ...
      if ((*packet)[EVENT_TYPE_OFFSET + 1] != ANNOTATE_ROWS_EVENT ||
          flags & BINLOG_SEND_ANNOTATE_ROWS_EVENT ||
          thd->server_id == 0 /* slave == mysqlbinlog */ )
      {
        my_net_write(net, packet->ptr(), packet->length());
      }
      ...
    }

7. How slave SQL thread processes Annotate_rows events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The slave processes each recieved event by "applying" it, i.e. by
calling the Log_event::apply_event() function which in turn calls
the virtual do_apply_event() member specific for each type of the
event.

    int exec_relay_log_event(THD* thd, Relay_log_info* rli)
    { ...
      Log_event *ev = next_event(rli);
      ...
      apply_event_and_update_pos(ev, ...);
      
      if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
        delete ev;
      ...      
    }
    
    int apply_event_and_update_pos(Log_event *ev, ...)
    { ...
      ev->apply_event(...);
      ...
    }
    
    int Log_event::apply_event(...)
    {
      return do_apply_event(...);
    }

What does it mean to "apply" an Annotate_rows event? It means to set current
thd query to that of the described by the event, i.e. to the query which
caused the subsequent Rows events (see "How Master writes Annotate_rows
events to the binary log" to follow what happens further when the subsequent
Rows events are applied):

    int Annotate_rows_log_event::do_apply_event(...)
    {
      thd->set_query(m_query_txt, m_query_len);
    }

NOTE. I am not sure, but possibly current values of thd->query and
thd->query_length should be saved before calling set_query() and to be
restored on the Annotate_rows_log_event object deletion.
Is it really needed ?

After calling this do_apply_event() function we may not delete the
Annotate_rows_log_event object immediatedly (see exec_relay_log_event()
above) because thd->query now points to the string inside this object.
We may keep the pointer to this object in the Relay_log_info:

    class Relay_log_info
    {
    public:
      ...
      void set_annotate_event(Annotate_rows_log_event*);
      Annotate_rows_log_event* get_annotate_event();
      void free_annotate_event();
      ...
    private:
      Annotate_rows_log_event* m_annotate_event;
    };

When the saved Annotate_rows object may be deleted? When all corresponding
Rows events will be processed, i.e. before processing the first non-Rows
event (note that Annotate_rows object resides in the binary log *after*
the (possible) 'BEGIN' Query event which accompanies the rows events; note
also that this deletion is adjusted with the case when some or all
corresponding Rows events are filtered out by replicate filter rules):

    int exec_relay_log_event(THD* thd, Relay_log_info* rli)
    { ...
      Log_event *ev= next_event(rli);
      ...
      if (rli->get_annotate_event() && !IS_RBR_EVENT_TYPE(ev->get_type_code()))
        rli->free_annotate_event();

      apply_event_and_update_pos(ev, ...);
      
      if (ev->get_type_code() == ANNOTATE_ROWS_EVENT)
        rli->set_annotate_event((Annotate_rows_log_event*) ev);
      else if (ev->get_type_code() != FORMAT_DESCRIPTION_EVENT)
        delete ev;
      ...      
    }

where

    #define IS_RBR_EVENT_TYPE(type) ( (type) == TABLE_MAP_EVENT   ||  \
                                      (type) == WRITE_ROWS_EVENT  ||  \
                                      (type) == UPDATE_ROWS_EVENT ||  \
                                      (type) == DELETE_ROWS_EVENT )

8. General remarks
~~~~~~~~~~~~~~~~~~
Kristian noticed that introducing new log event type should be coordinated
somehow with MySQL/Sun:

    Kristian: The numeric code for this event must be assigned carefully.
    It should be coordinated with MySQL/Sun, otherwise we can get into a
    situation where MySQL uses the same numeric code for one event that
    MariaDB uses for ANNOTATE_ROWS_EVENT, which would make merging the two
    impossible.
    Alex: I reserved about 20 numbers not to have possible conflicts
    with MySQL.
    Kristian: Still, I think it would be appropriate to send a polite email
    to internals@xxxxxxxxxxxxxxx about this and suggesting to reserve the
    event number.

Also we should notice the introduction of the BINLOG_SEND_ANNOTATE_ROWS_EVENT
flag taking into account that MySQL/Sun may also introduce a flag with the
same value to be used in the request_dump-mysql_binlog_send interface.
But this is mainly the question of merging: if a conflict concerning this
flag occur, we may simply change the BINLOG_SEND_ANNOTATE_ROWS_EVENT value
(this does not require additional changes in the code).


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)