maria-developers team mailing list archive

Thread
Date
Re: Review request for MDEV-6020, MDEV-5914, MDEV-5941, and MDEV-5262

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Thu, 03 Jul 2014 14:51:46 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20140703082958.GB7882@meddwl.fritz.box> (Sergei Golubchik's message of "Thu, 3 Jul 2014 10:29:58 +0200")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Sergei Golubchik <serg@xxxxxxxxxxx> writes:

> Here it is, below.

Thanks for review!

I'll give comments inline where applicable, other points I will just implement
as you suggested, once the overall API is decided.

> * MySQL has a variable binlog_order_commits.
>   we can introduce it too - it allows to cut lots of corners
>   in binlogging code. And most users don't need strict binlog ordering
>   most of the time.

I don't understand.

AFAIU, binlog_order_commits mainly ensures that the InnoDB redo log has the
same commit sequence as the binlog. This is mainly needed to be able to take a
non-blocking XtraBackup (without FLUSH TABLES WITH READ LOCK) and use it as a
slave. This does not seem relevant to the case at hand?

My guess is, you have in mind a different option, that allows a slave to write
its binlog in a different order than what it was on the master?

In MySQL, there is no facility to force same binlog order on a slave as on a
master, though when using single-treaded replication it will normally be the
case anyway. Instead, there are various mechanisms in their parallel
replication and in their GTID to deal with out-of-order binlogs between master
and slave.

Is this what you are suggesting? An option that allows the MariaDB slave to
commit in different order than the master?

The main problem with that is that it fundamentally breaks the design of
MariaDB GTID. Also, it makes it hard to know where in the binlog to restart
the slave after a slave crash?

There are some interesting possibilities in this direction, but I'd like to
understand better what you have in mind before starting to discuss in a random
direction.

> * I tried to ignore background killing code, because we're discussing
>   whether it'll stay or not. And the discussion of the API is happening
>   in another thread. So I was mostly looking at the general logic and at
>   the binlog re-reading code, that we've already discussed on irc.

Agree.

>> === modified file 'sql/sql_admin.cc'
>> --- sql/sql_admin.cc	2014-03-27 20:32:53 +0000
>> +++ sql/sql_admin.cc	2014-06-10 17:48:32 +0000
>> @@ -912,7 +912,7 @@ static bool mysql_admin_table(THD* thd,
>>        protocol->store(operator_name, system_charset_info);
>>        if (result_code) // either mysql_recreate_table or analyze failed
>>        {
>> -        DBUG_ASSERT(thd->is_error() || thd->killed);
>> +        DBUG_ASSERT(thd->is_error());
>
> why?

This is a revert of incorrect fix for MySQL Bug#58933.

The original bug was that some code was returning an error if thd->killed, but
it was not setting thd->is_error() with send_kill_message() or similar.

Rather than fix the real bug, the Bug#58933 patch changed the assertion.

I happened upon the real bug during my testing, and had to fix it to make my
code work. I somehow noticed the incorrect fix for Bug#58933, and decided to
change back the assertion to what it was originally, as the extra condition is
no longer necessary.

>> === modified file 'storage/heap/hp_write.c'
>> --- storage/heap/hp_write.c	2014-03-26 21:25:38 +0000
>> +++ storage/heap/hp_write.c	2014-06-10 17:48:32 +0000
>> @@ -153,10 +153,10 @@ static uchar *next_free_record_pos(HP_SH
>>          (info->data_length + info->index_length >= info->max_table_size))
>>      {
>>        DBUG_PRINT("error",
>> -                 ("record file full. records: %u  max_records: %lu  "
>> +                 ("record file full. records: %lu  max_records: %lu  "
>>                    "data_length: %llu  index_length: %llu  "
>>                    "max_table_size: %llu",
>> -                  info->records, info->max_records,
>> +                  (unsigned long)info->records, info->max_records,
>
> Eh? Why did you add a cast from ulong to ulong?

Sorry, will fix (I think I got a compiler warning, and fixed it without
looking up the type. Generally, I tend to just use %lu and (ulong) cast on
values in debug printouts, to avoid problems with eg. %llu not being available
everywhere etc. Sloppy, I suppose...)

>> === modified file 'sql/sql_class.h'
>> --- sql/sql_class.h	2014-04-25 10:58:31 +0000
>> +++ sql/sql_class.h	2014-06-10 17:48:32 +0000
>> @@ -1357,7 +1357,8 @@ enum enum_thread_type
>>    SYSTEM_THREAD_NDBCLUSTER_BINLOG= 8,
>>    SYSTEM_THREAD_EVENT_SCHEDULER= 16,
>>    SYSTEM_THREAD_EVENT_WORKER= 32,
>> -  SYSTEM_THREAD_BINLOG_BACKGROUND= 64
>> +  SYSTEM_THREAD_BINLOG_BACKGROUND= 64,
>> +  SYSTEM_THREAD_SLAVE_BACKGROUND= 128,
>
> is it your background killer thread?

Yes.

>> === modified file 'storage/innobase/lock/lock0lock.cc'
>> --- storage/innobase/lock/lock0lock.cc	2014-02-26 18:22:48 +0000
>> +++ storage/innobase/lock/lock0lock.cc	2014-06-10 17:48:32 +0000
>> @@ -1020,6 +1021,28 @@ lock_rec_has_to_wait(
>>  			return(FALSE);
>>  		}
>>  
>> +		if ((type_mode & LOCK_GAP || lock_rec_get_gap(lock2)) &&
>> +		    !thd_need_ordering_with(trx->mysql_thd,
>> +					    lock2->trx->mysql_thd)) {
>> +			/* If the upper server layer has already decided on the
>> +			commit order between the transaction requesting the
>> +			lock and the transaction owning the lock, we do not
>> +			need to wait for gap locks. Such ordeering by the upper
>> +			server layer happens in parallel replication, where the
>> +			commit order is fixed to match the original order on the
>> +			master.
>> +
>> +			Such gap locks are mainly needed to get serialisability
>> +			between transactions so that they will be binlogged in
>> +			the correct order so that statement-based replication
>> +			will give the correct results. Since the right order
>> +			was already determined on the master, we do not need
>> +			to enforce it again here (and doing so could lead to
>> +			occasional deadlocks). */
>
> Please, clarify this comment. It is not clear whether this if() is required
> for correctness, for the code to work. Or it's merely an optimization,
> and these occasional deadlocks will only cause the transactiion to be
> reexecuted, but the end result will still be the same.
>
> I believe, you meant the latter.

Yes, the latter. I will try to clarify, also give the actual examples with
INSERT vs. DELETE and INSERT vs. UPDATE.

Or do you think we should omit this part of the patch, not optimise this
particular case, and instead just rely on the general fallback of detecting a
deadlock and killing+retrying?

>> === modified file 'sql/log.cc'
>> --- sql/log.cc	2014-03-23 19:09:38 +0000
>> +++ sql/log.cc	2014-06-10 17:48:32 +0000
>> @@ -4067,7 +4068,30 @@ int MYSQL_BIN_LOG::purge_first_log(Relay
>>    DBUG_ASSERT(!strcmp(rli->linfo.log_file_name,rli->event_relay_log_name));
>>  
>>    mysql_mutex_lock(&LOCK_index);
>> -  to_purge_if_included= my_strdup(rli->group_relay_log_name, MYF(0));
>> +
>> +  ir= rli->inuse_relaylog_list;
>> +  while (ir)
>> +  {
>> +    inuse_relaylog *next= ir->next;
>
> are relay logs ordered in this list?
> That is, all newer relay logs are after the current one (ir),
> all older relay logs are before?

Yes, they should be, I think.

>
>> +    if (!ir->completed || ir->dequeued_count < ir->queued_count)
>> +    {
>> +      included= false;
>> +      break;
>> +    }
>> +    if (!included && 0 == strcmp(ir->name, rli->group_relay_log_name))
>
> do we ever use this reverse style in the code (const == expr)?
> I don't remember that

I sometimes use it to make it clearer that it's "compare strings for
equality".

It was not clear if you wanted me to change it to !strcmp(...) ?
Or if you were just asking?

>> === modified file 'sql/log_event.cc'
>> --- sql/log_event.cc	2014-04-25 10:58:31 +0000
>> +++ sql/log_event.cc	2014-06-10 17:48:32 +0000
>> @@ -4384,18 +4420,21 @@ Default database: '%s'. Query: '%s'",
>>      {
>>        DBUG_PRINT("info",("error ignored"));
>>        clear_all_errors(thd, const_cast<Relay_log_info*>(rli));
>> -      thd->reset_killed();
>> +      if (actual_error == ER_QUERY_INTERRUPTED ||
>> +          actual_error == ER_CONNECTION_KILLED)
>> +        thd->reset_killed();
>
> Why?

This code was clearing thd->killed flag unconditionally, after every executed
Query_log_event.

This causes a race. Suppose we detect a deadlock that T1 needs to wait for T2
just before the code executing T2 reaches this place. We set the kill flag for
T2, but immedialy afterwards this flag gets cleared. Thus the kill of T2 is
lost, and we end up in a deadlock.

I got this during my stress testing, very occasionally the kill would be lost
and my tests would hang. Had a bit of fun tracking it down, IIRC :-)

This code runs if either there is no error, or there was an error that was
explicitly requested to be ignored. So I changed it to only clear thd->killed
if we are actually ignoring a killed error.

(I wonder if that thd->reset_killed() is really needed at all? I suspect not,
but I don't know how to check for sure. So I prefered to do a minimal change
like this to be safe, as this is a change in a GA version).

>> @@ -7284,28 +7321,34 @@ int Xid_log_event::do_apply_event(rpl_gr
>>    bool res;
>>    int err;
>>    rpl_gtid gtid;
>> -  uint64 sub_id;
>> +  uint64 sub_id= 0;
>>    Relay_log_info const *rli= rgi->rli;
>>  
>> +  mysql_reset_thd_for_next_command(thd);
>
> hmm. mysql_reset_thd_for_next_command() is called before the new
> sql statement, not at the end of the old one.
> So, you end up calling it twice.
> It is not a function to use when you only need to reset the error status.
> The same, I suppose, applies to the previous chunk, Gtid_log_event::do_apply_event

So my problem here is a lack of knowledge of how things should be done
correctly.

I think the issue here is that I added code that updates the
mysql.gtid_slave_pos table, before doing the actual commit of the
Xid_log_event.

This table update opens tables, it can auto-commit a transaction, it can throw
errors. Thus it does much of what is involved in "a new sql statement".

Before I added this call, I got various assertions, like about ok status
already being set if I gave an error, stuff like that.

So if mysql_reset_thd_for_next_command() is not correct, what should I do
instead?


>> +static void
>> +convert_kill_to_deadlock_error(rpl_group_info *rgi)
>> +{
>> +  THD *thd= rgi->thd;
>> +  int err_code= thd->get_stmt_da()->sql_errno();
>> +
>> +  if ((err_code == ER_QUERY_INTERRUPTED || err_code == ER_CONNECTION_KILLED) &&
>> +      rgi->killed_for_retry)
>> +  {
>> +    thd->clear_error();
>> +    thd->get_stmt_da()->reset_diagnostics_area();
>
> Why do you always do both clear_error() and reset_diagnostics_area()?
> The former normally includes the latter.

Just lack of knowledge of the right way to do things, I think.
I will replace it with just thd->clear_error().

>> +  do
>> +  {
>> +    Log_event_type event_type;
>> +    Log_event *ev;
>> +    rpl_parallel_thread::queued_event *qev;
>> +
>> +    /* The loop is here so we can try again the next relay log file on EOF. */
>> +    for (;;)
>> +    {
>> +      old_offset= cur_offset;
>> +      ev= Log_event::read_log_event(&rlog, 0,
>> +                                    rli->relay_log.description_event_for_exec /* ToDo: this needs fixing */,
>> +                                    opt_slave_sql_verify_checksum);
>
> I don't particularly like that there's a second binlog-reading loop here.
> It would be better to abstract the existing one, somehow, and reuse it here.
> But I suppose it would've been too difficult to do :(

Indeed, I dislike this as well, and it was the original reason I did not feel
like trying to implement retry in parallel replication (before I found I
needed it to handle deadlocks).

Retry happens very rarely, and a completely different code path for reading
binlog in this case is very prone to introducing new bugs due to insufficient
testing.

But yes, I also did not find a good way to avoid it :-(

At least it turned out to be less code needed than I had feared initially. And
most of it is already abstracted in called functions like read_log_event() and
open_binlog().

Thanks,

 - Kristian.
Follow ups

Re: Review request for MDEV-6020, MDEV-5914, MDEV-5941, and MDEV-5262
From: Sergei Golubchik, 2014-07-04
References

Review request for MDEV-6020, MDEV-5914, MDEV-5941, and MDEV-5262
From: Kristian Nielsen, 2014-06-10
Re: Review request for MDEV-6020, MDEV-5914, MDEV-5941, and MDEV-5262
From: Sergei Golubchik, 2014-07-03