maria-developers team mailing list archive

Thread
Date

Re: Deadlock with STOP SLAVE and LOCK_active_mi

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Michael Widenius <michael.widenius@xxxxxxxxx>
Date: Sun, 24 Apr 2016 14:04:48 +0300
Cc: MariaDB Developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <1542e2ffae8.2826.04097479319f7dbfc4cc130de0ac6822@mail.askmonty.org>

Hi!

>>> Any suggestions for how this is supposed to work? Or is it just broken
>>> by design, but saved because normally slave threads do not need to
>>> access SHOW STATUS or system variables?

<cut>

> Hm. So I went through all of the code that references LOCK_active_mi to try
> and understand this. It seems there are two main uses:
>
> 1. As you say, to serialise admin commands. I think the list is: end_slave()
> (server shutdown); CHANGE MASTER; START/STOP SLAVE; START/STOP ALL SLAVES;
> RESET SLAVE; FLUSH SLAVE.
>
> 2. To protect access to master_info_index, eg. to prevent a Master_info from
> disappearing while it is accessed.

When reading about this, I think it would be better to have a counter
in Master_info if someone is using it and only delete if if the
counter is zero. Would be trivial to add an increment of the counter
in get_master_info().

The counter could either be protected by the LOCK_active_mi or one of
the mutexes in Master_info like data_lock.

> There is a comment in rpl_rli.h that LOCK_active_mi protects
> Relay_log_info::inited, but I did not understand it - maybe it is wrong?

Setting inited is indirectly protected by LOCK_active_mi, as
init_all_master_info() which calls init_master_info(), is protected by
LOCK_active_mi.

Looking at the code, I didn't however see any need to protect inited
as this is an internal
flag that is always 1 after start. It's main usage is to avoid some
re-initialization of relay logs on retries.

> So the idea would be to make a new lock for (1), and keep LOCK_active_mi for
> (2).
>
> Actually, using a lock for (1) seems a bad idea - STOP SLAVE can take a long
> time to run, and a mutex makes it unkillable. A condition variable might be
> better (to make STOP SLAVE killable). But I guess that is a separate issue,
> it would not affect the possibility of deadlocks.

Agree that a condition variable would be better, but not critical as
one has already a problem if stop_slave doesn't work.

> Both LOCK_active_mi and the new LOCK_serialize_replication_admin_commands
> must by themselves prevent master_info_index from having elements added or
> removed. This is needed for start_all_slaves() and stop_all_slaves() to work
> correctly, I think.

Why both?
Isn't it enough to first take LOCK_serialize_replication_admin_commands and
then take LOCK_active_mi if one needs to access master_info_index ?

> Something will need to be done for remove_master_info(). Currently, it also
> tries to stop slave threads (from within free_key_master_info()) while
> holding LOCK_active_mi. It seems that threads should be stopped before
> attempting to remove their mi from master_info_index, probably?

To do that we would need to:
- Add a flag in Master_info that it's not usable anymore and change
this flag only under LOCK_active_mi.  get_master_info() could wait (on
a condition) if this flag is set.
- Add a counter that Master_info is in use.
- Add a function to release Master_info.
- Call terminate_slave_threads() outside of free_key_master_info()

> Actually, there would still be the deadlock problem after introducing the
> new lock, because it is possible for a slave thread to run CHANGE MASTER!
> (and I think START SLAVE / RESET SLAVE as well). But that is probably a bug,
> does not seem good that this is possible. I guess these need a check to
> throw an ER_LOCK_OR_ACTIVE_TRANSACTION error, like STOP SLAVE.

Wouldn't the new lock LOCK_serialize_replication_admin_commands fix that?

> I think I also found another lock problem while reading the code, for
> MASTER_POS_WAIT() og SHOW BINLOG EVENTS. They both grab a Master_info under
> LOCK_active_mi, but then proceed to access it after releasing the lock. I do
> not see anything that prevents the Master_info to disappear under their feet
> if RESET SLAVE ALL is run in another thread? But again, that's a separate
> issue, I suppose. Probably there are some other tricky deadlock issues
> lurking here.

I checked the function mysql_show_binlog_events() but could not find
any access to
Master_info after mysql_mutex_unlock(&LOCK_active_mi) was released.

> I don't know. It seems splitting into a new
> LOCK_serialize_replication_admin_commands could solve the deadlock issue,
> maybe, if the other problems mentioned above are addressed. It seems very
> hard to convince oneself that this would not introduce new problems
> somewhere, the locking in replication is really complex and feels very
> fragile :-/ It doesn't really feel like something one would want to do in
> 10.1 (certainly not 10.0), but maybe 10.2?
>
> It's kind of a very edge-case problem (and seems to have been there for a
> _long_ time). Still, hanging the mysqld process hard, requiring kill -9, is
> also not nice...

Adding two new flags, one if master_info is in use and one if it's
unusable, shouldn't be
that hard to make reasonable safe in 10.1

I am now back in Finland and working. Feel free to call me to discuss
this any time.

Regards,
Monty

Follow ups

Re: Deadlock with STOP SLAVE and LOCK_active_mi
From: Kristian Nielsen, 2016-04-25

References

Deadlock with STOP SLAVE and LOCK_active_mi
From: Kristian Nielsen, 2016-04-15
Re: Deadlock with STOP SLAVE and LOCK_active_mi
From: Sergei Golubchik, 2016-04-15
Re: Deadlock with STOP SLAVE and LOCK_active_mi
From: Kristian Nielsen, 2016-04-18