← Back to team overview

maria-developers team mailing list archive

Re: e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync

 

Hi, Andrei!

On May 27, Andrei Elkin wrote:
>>> 
>>> I had a rather sobering followup, slipped communicating to you
>>> though, sorry.
>>> 
>>> It's actually turns out not to be easy. A sequence of execution events
>>> 
>>>   E1.prepare(trx), E2.prepare(trx), E1.commit(trx), *crash*
>>> 
>>> does not make E1 such as Innodb committed persistently in file.  At
>>> recovery in E1 trx may be found in prepared state, unless before the
>>> crash I'd do something like Binlog CheckPoint (BCP) notification request
>>> and wait of it.
>>
>> I don't understand, what a binlog checkpoint has to do with it?
> 
> So E1=Innodb.
> In the above case in order to create partially committed trx
>   Innodb.commit(trx)
> must be followed with flushing trx to disk, which is typically
> asynchronous.
> 
> I mentioned BCP 'cos it ensures a trx (actually the last in a binlog
> file) is flushed.

With --innodb-flush-log-at-trx-commit=1 it should be perfectly
synnchronous, should it not?

> >> In this just a selected engine would be requested, unlike in BCP
> >> event case that is generated when both (in this case) have
> >> persistently committed. (That is we even can't involve BCP event to
> >> wait for, and the involvement would require some simulation as
> >> well).
> >> 
> >> So I propose to keep the current method leaving the "honest" crash
> >> to RQG that Alice has been training the patch with.
> >
> > Alice was loading the server with transactions and doing `kill -9`
> > at some point. Many interesting cases that we discuss (for example,
> > the one above) happen in a very small time window, so I don't know
> > whether random killing could provide an adequate test coverage.
> 
> No doubt it's technically possible to wait for the flushing and crash,
> but still does not look easy.
> Consider if we defer this sort of mtr test to MDEV-18959?

But we've spent numerous hours and emails discussing and you've spend
countless days and weeks implementing various tricky corner cases are
are and won't be ever tested.

Here's an idea of how to test it.
We're only interested in crashes during commit. So
* when starting a commit - atomically increment some in_commit counter
* on leaving the commit - decrement it
* on signal (some unused one) - crash the server if in_commit > 0
* after the crash - get the stack traces from the core to see where in the commit did it crash.

I've attached a small patch to illustrate this idea.

Hopefully, it'll cover everything, but if not, then after analyzing
stack traces we'll see where it did not crash, and move atomic
increment/decrements to be around this area only.

> >> >> diff --git a/sql/handler.h b/sql/handler.h
> >> >> --- a/sql/handler.h
> >> >> +++ b/sql/handler.h
> >> >> @@ -873,6 +874,15 @@ typedef struct xid_t XID;
> >> >>  /* The 'buf' has to have space for at least SQL_XIDSIZE bytes. */
> >> >>  uint get_sql_xid(XID *xid, char *buf);
> >> >>  
> >> >> +/* struct for semisync slave binlog truncate recovery */
> >> >> +struct xid_recovery_member
> >> >> +{
> >> >> +  my_xid xid;
> >> >> +  uint in_engine_prepare;  // number of engines that have xid prepared
> >> >> +  bool decided_to_commit;
> >> >> +  std::pair<uint, my_off_t> binlog_coord; // semisync recovery binlog offset
> >> >
> >> > wouldn't it be clearer to have a struct with named members?
> >> 
> >> I am not sure if there's a need for naming. But we do need
> >> comparison that std::pair provides for granted.
> >> 
> >> I could consider (a) new typedef(s) to represent the binlog id and
> >> offset in the file to make it more readable.
> >> 
> >> You say?
> >> 
> >> > in fact, I'm somewhat surprised there's no such struct for binlog coords
> >> > already.
> >> 
> >> I had started that back at HB implementation actually, but that struct
> >> did not attact much of attention.
> >
> 
> > I'd still suggest a structure, like
> >
> > struct binlog_coords_t {
> >   uint binlog_num;
> >   my_off_t offset;
> > };
> 
> Well, with this, like I said about std:pair, we'd also have to have
>   bool operator < (const binlog_coords_t)

like

  bool operator <(const binlog_coords_t c)
  { return binlog_num < c.binlog_num || binlog_num == c.binlog_num && offset < c.offset; }

> I seriously rate refuse to reuse as a deadly sin of programming.
> (Unlike normal sins it may be soften by realization though :-)).
> 
> Let's waive it away, right :-)?
> 
> Hoping on that `72a59e4b068` already defined the following:
...
> +/*
> +  Compound binlog-id and byte offset of transaction's first event
> +  in a sequence (e.g the recovery sequence) of binlog files.
> +  Binlog_offset(0,0) is the minimum value to mean
> +  the first byte of the first binlog file.
> +*/
> +typedef std::pair<Binlog_file_id, my_off_t> Binlog_offset;

This is better than writing std::pair<uint, my_off_t> everywhere.
Not as clear as a binlog_coords_t from above, but _perhaps_
it'll do, let's see how it'll be used.

Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx
diff --git a/sql/handler.cc b/sql/handler.cc
--- a/sql/handler.cc
+++ b/sql/handler.cc
@@ -47,6 +47,8 @@
 #include "rowid_filter.h"
 #include "mysys_err.h"
 
+#include <atomic>
+extern std::atomic<uint> commits_in_progress;
 #ifdef WITH_PARTITION_STORAGE_ENGINE
 #include "ha_partition.h"
 #endif
@@ -1742,6 +1744,7 @@ int ha_commit_trans(THD *thd, bool all)
 
   need_prepare_ordered= FALSE;
   need_commit_ordered= FALSE;
+  commits_in_progress++;
 
   for (Ha_trx_info *hi= ha_info; hi; hi= hi->next())
   {
@@ -1876,6 +1879,7 @@ int ha_commit_trans(THD *thd, bool all)
                 thd->rgi_slave->is_parallel_exec);
   }
 end:
+  commits_in_progress--;
   if (mdl_backup.ticket)
   {
     /*
diff --git a/sql/mysqld.cc b/sql/mysqld.cc
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -162,6 +162,8 @@ extern "C" {					// Because of SCO 3.2V4.2
 #endif /* __WIN__ */
 
 #include <my_libwrap.h>
+#include <atomic>
+std::atomic<uint> commits_in_progress(0);
 
 #ifdef __WIN__ 
 #include <crtdbg.h>
@@ -3167,6 +3169,9 @@ pthread_handler_t signal_hand(void *arg __attribute__((unused)))
       process_alarm(sig);			// Trigger alarms.
       break;
 #endif
+    case SIGUSR2:
+      if (commits_in_progress > 0)
+        __builtin_trap();
     default:
 #ifdef EXTRA_DEBUG
       sql_print_warning("Got signal: %d  error: %d",sig,error); /* purecov: tested */

References