maria-developers team mailing list archive

Thread
Date
Re: e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync

To: Andrei Elkin <andrei.elkin@xxxxxxxxxxx>
From: Sergei Golubchik <serg@xxxxxxxxxxx>
Date: Wed, 26 May 2021 21:41:16 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx, alice.sherepa@xxxxxxxxxxx
In-reply-to: <87mttjhir0.fsf@quad>
Hi, Andrei!

On Apr 27, Andrei Elkin wrote:
> >> +--let $case = B: one engine has committed its transaction branch
> >> +# Hold off after one engine has committed.
> >> +--let $shutdown_timeout=0
> >> +--let $debug_sync_action = "commit_after_run_commit_ordered SIGNAL con1_ready WAIT_FOR signal_no_signal"
> >> +# Both debug_sync and debug-dbug are required to make sure Engines remember the commit state
> >> +# debug_sync alone will not help.
> >> +--let $restart_parameters = --rpl-semi-sync-slave-enabled=1 --debug-dbug=d,binlog_truncate_partial_commit
> >
> > in the first review I wrote
> >
> >   this seems to be a rather crude way of faking a partially committed
> >   transaction. better to crash after the first engine has committed,
> >   that'd be much more natural.
> >
> > and you replied
> >
> >   This simulation aimed at (allows for) more complicated recovery time
> >   event sequences.
> >   In this case, indeed, crashing by demand is about of the same efforts.
> >   I can convert to that.
> >
> >   [x]
> 
> I had a rather sobering followup, slipped communicating to you though, sorry.
> 
> It's actually turns out not to be easy. A sequence of execution events
> 
>   E1.prepare(trx), E2.prepare(trx), E1.commit(trx), *crash*
> 
> does not make E1 such as Innodb committed persistently in file.  At
> recovery in E1 trx may be found in prepared state, unless before the
> crash I'd do something like Binlog CheckPoint (BCP) notification request
> and wait of it.

I don't understand, what a binlog checkpoint has to do with it?

> In this just a selected engine would be requested, unlike in BCP event
> case that is generated when both (in this case) have persistently committed.
> (That is we even can't involve BCP event to wait for, and the involvement
> would require some simulation as well).
> 
> So I propose to keep the current method leaving the "honest" crash to
> RQG that Alice has been training the patch with.

Alice was loading the server with transactions and doing `kill -9` at
some point. Many interesting cases that we discuss (for example, the one
above) happen in a very small time window, so I don't know whether
random killing could provide an adequate test coverage.

> >> diff --git a/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test b/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test
> >> --- /dev/null
> >> +++ b/mysql-test/suite/rpl/t/rpl_semi_sync_fail_over.test
> >> @@ -0,0 +1,143 @@
> >> +
> >> +--source include/have_innodb.inc
> >> +--source include/have_debug_sync.inc
> >> +--source include/have_binlog_format_row.inc
> >> +--let $rpl_topology=1->2
> >> +--source include/rpl_init.inc
> >
> > why not to source master-slave.inc if you're using a standard master-slave
> > topology anyway?
> 
> Using numbers it's easier to manipulate with failover operatations.
> Besides it's truly uncomfortable both read and write
> when `master` would be serving the slave role as it happens in the test :-).

I believe I already replied to that earlier.
master-slave.inc creates server_1 and server_2 connections all right.

> >> diff --git a/sql/handler.h b/sql/handler.h
> >> --- a/sql/handler.h
> >> +++ b/sql/handler.h
> >> @@ -873,6 +874,15 @@ typedef struct xid_t XID;
> >>  /* The 'buf' has to have space for at least SQL_XIDSIZE bytes. */
> >>  uint get_sql_xid(XID *xid, char *buf);
> >>  
> >> +/* struct for semisync slave binlog truncate recovery */
> >> +struct xid_recovery_member
> >> +{
> >> +  my_xid xid;
> >> +  uint in_engine_prepare;  // number of engines that have xid prepared
> >> +  bool decided_to_commit;
> >> +  std::pair<uint, my_off_t> binlog_coord; // semisync recovery binlog offset
> >
> > wouldn't it be clearer to have a struct with named members?
> 
> I am not sure if there's a need for naming. But we do need
> comparison that std::pair provides for granted.
> 
> I could consider (a) new typedef(s) to represent the binlog id and
> offset in the file to make it more readable.
> 
> You say?
> 
> > in fact, I'm somewhat surprised there's no such struct for binlog coords
> > already.
> 
> I had started that back at HB implementation actually, but that struct
> did not attact much of attention.

I'd still suggest a structure, like

struct binlog_coords_t {
  uint binlog_num;
  my_off_t offset;
};

> >
> >> +};
> >> +
> >>  /* for recover() handlerton call */
> >>  #define MIN_XID_LIST_SIZE  128
> >>  #define MAX_XID_LIST_SIZE  (1024*128)
> >> @@ -4820,7 +4830,8 @@ int ha_commit_one_phase(THD *thd, bool all);
> >>  int ha_commit_trans(THD *thd, bool all);
> >>  int ha_rollback_trans(THD *thd, bool all);
> >>  int ha_prepare(THD *thd);
> >> -int ha_recover(HASH *commit_list);
> >> +int ha_recover(HASH *commit_list, MEM_ROOT *mem_root= NULL);
> >> +uint ha_recover_complete(HASH *commit_list, std::pair<uint, my_off_t> *coord= NULL);
> >
> > is coord a truncation position?
> 
> So NULL is for the normal recovery.
> Non-NULL is an estimate of the 1st to rollback.
> Trx with lesser offsets are to commit, greater of equal - to rollback.
> 
> Why it's only an estimate is 'cos the function also handles
> no-truncation cases, it then passes as `coord` the last committed
> trx's coordinate.
> The last committed can't be decided to rollback because of higher precedence of
> 
>   xid_recovery_member::decided_to_commit
> 
> which is `true` under current circumstance, in
> 
>    +static bool xarecover_decide(xid_recovery_member* member,
>    +                             xid_t x, std::pair<uint, my_off_t> *coord_ptr)
>    +{
>    +  return
>    +    member->decided_to_commit ? true :
>    +    !coord_ptr ? false :
>    +    (member->binlog_coord < *coord_ptr 
> 
> I did not reflect the subtlety of the above in function header comments.
> To be done, if you're sastified here.

yes, please, reflect it. thanks :)

> >> diff --git a/sql/log_event.h b/sql/log_event.h
> >> --- a/sql/log_event.h
> >> +++ b/sql/log_event.h
> >> @@ -482,6 +482,16 @@ class String;
> >>  */
> >>  #define LOG_EVENT_IGNORABLE_F 0x80
> >>  
> >> +/**
> >> +   @def LOG_EVENT_ACCEPT_OWN_F
> >> +
> >> +   Flag sets by the semisync slave for accepting
> >> +   the same server_id ("own") events which the slave must not have
> >> +   in its state. Typically such events were never committed by
> >> +   their originator (this server) and discared at its semisync-slave recovery.
> >> +*/
> >> +#define LOG_EVENT_ACCEPT_OWN_F 0x4000
> >
> > may be, add an assert on all received events that such a flag is not set?
> > it can only be set on events in relay log.
> 
> There's no place the flag gets logged. It'd be way too fool-proof in my
> opinion :-)!
> Or if you have a case I'll do.

too fool-proof is when you do

   a=1;
   assert(a == 1);

if you write a binlog event to disk, another thread reads it and sends
over the network to another server, which also writes it to disk first,
then another thread reads it again and executes - I don't think it's
"too" to check that the event doesn't have this flag.

Also, I personally value a lot a documentation role of asserts. Assert,
like

    DBUG_ASSERT(ev->flags & LOG_EVENT_ACCEPT_OWN_F == 0);

is basically saying

   /* LOG_EVENT_ACCEPT_OWN_F must be not set here */

but it's a self-fulfilling comment, it documents the code and at the
same time guarantees that the documentation is up-to-date.

> >> @@ -3357,6 +3367,12 @@ class Gtid_log_event: public Log_event
> >>    uint64 commit_id;
> >>    uint32 domain_id;
> >>    uchar flags2;
> >> +  uint  flags_extra; // more flags area placed after the regular flags2's one
> >> +  /*
> >> +    Extra to a "base" engine recoverable engines participating
> >> +    in the transaction. Zero, when the base engine only is present.
> >
> > what's a "base engine"?
> 
> The 1st ha_info one that is not binlog.

why not to say "number of participating engines minus one" ?
this won't puzzle anyone with a new concept of a "base engine"

> >
> >> +  */
> >> +  uint8 extra_engines;
> >>  
> >>    /* Flags2. */
> >>  
> >> diff --git a/sql/handler.cc b/sql/handler.cc
> >> --- a/sql/handler.cc
> >> +++ b/sql/handler.cc
> >> @@ -1637,9 +1672,17 @@ commit_one_phase_2(THD *thd, bool all, THD_TRANS *trans, bool is_real_trans)
> >>      DEBUG_SYNC(thd, "commit_one_phase_2");
> >>    if (ha_info)
> >>    {
> >> +    int err;
> >> +
> 
> >> +    if (has_binlog_hton(ha_info) &&
> >
> > can you replace has_binlog_hton() with, like, if trx cache is not empty or
> > binlog enabled or something like that?
> 
> I see you're driving at removing binlog ha_info from the list, are not
> you?

I'm not sure :)
Yes, I'd like to get rid of binlog_hton, as we discussed. But here,
perhaps, I just didn't like that you scan the list for something that
most likely could be checked much easier.

> I already made some experiments to have observed various complications
> in the way of implementing an equivalent of
>   has_binlog_hton().
> The closest would be
>   non-empty cache.
> 
> But `binlog_commit` may need invoking even in this case, as its branch
> shows
> 
>   if (cache_mngr->trx_cache.empty())
>   {
>     /*
>       we're here because cache_log was flushed in MYSQL_BIN_LOG::log_xid()
>     */
>     cache_mngr->reset(false, true);
> 
> 
> Overall, I'd stay with the current approach to defer its refinement to
> time binlog_hton stops registering for transaction.
> 
> Your comment?

Okay. I just don't like scanning the list, aesthetically, when one can
check a value in some fixed location or a variable. But it's not a
performance concern, the list has usually only two elements and hardly
ever can have more than three. So okay, let's stay with the current
approach.

> > @@ -1962,8 +2008,177 @@ struct xarecover_st
> 
> >> sql_print_warning("%s transaction with xid %llu",
> >
> > may be not a sql_print_warning, but a sql_print_information?
> > it's just an informational message
> 
> [->]

I presume, it means [x]

> >> +                        member->decided_to_commit ?
> >> +                        "Committed" : "Rolled back", (ulonglong) member->xid);
> >> +  }
> >> +
> >> +  return false;
> >> +}

Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx
References

Re: e92037989f7: MDEV-21117: refine the server binlog-based recovery for semisync
From: Sergei Golubchik, 2021-04-25