maria-developers team mailing list archive
Mailing list archive
Re: Deadlock with STOP SLAVE and LOCK_active_mi
>>> Any suggestions for how this is supposed to work? Or is it just broken
>>> by design, but saved because normally slave threads do not need to
>>> access SHOW STATUS or system variables?
> Hm. So I went through all of the code that references LOCK_active_mi to try
> and understand this. It seems there are two main uses:
> 1. As you say, to serialise admin commands. I think the list is: end_slave()
> (server shutdown); CHANGE MASTER; START/STOP SLAVE; START/STOP ALL SLAVES;
> RESET SLAVE; FLUSH SLAVE.
> 2. To protect access to master_info_index, eg. to prevent a Master_info from
> disappearing while it is accessed.
When reading about this, I think it would be better to have a counter
in Master_info if someone is using it and only delete if if the
counter is zero. Would be trivial to add an increment of the counter
The counter could either be protected by the LOCK_active_mi or one of
the mutexes in Master_info like data_lock.
> There is a comment in rpl_rli.h that LOCK_active_mi protects
> Relay_log_info::inited, but I did not understand it - maybe it is wrong?
Setting inited is indirectly protected by LOCK_active_mi, as
init_all_master_info() which calls init_master_info(), is protected by
Looking at the code, I didn't however see any need to protect inited
as this is an internal
flag that is always 1 after start. It's main usage is to avoid some
re-initialization of relay logs on retries.
> So the idea would be to make a new lock for (1), and keep LOCK_active_mi for
> Actually, using a lock for (1) seems a bad idea - STOP SLAVE can take a long
> time to run, and a mutex makes it unkillable. A condition variable might be
> better (to make STOP SLAVE killable). But I guess that is a separate issue,
> it would not affect the possibility of deadlocks.
Agree that a condition variable would be better, but not critical as
one has already a problem if stop_slave doesn't work.
> Both LOCK_active_mi and the new LOCK_serialize_replication_admin_commands
> must by themselves prevent master_info_index from having elements added or
> removed. This is needed for start_all_slaves() and stop_all_slaves() to work
> correctly, I think.
Isn't it enough to first take LOCK_serialize_replication_admin_commands and
then take LOCK_active_mi if one needs to access master_info_index ?
> Something will need to be done for remove_master_info(). Currently, it also
> tries to stop slave threads (from within free_key_master_info()) while
> holding LOCK_active_mi. It seems that threads should be stopped before
> attempting to remove their mi from master_info_index, probably?
To do that we would need to:
- Add a flag in Master_info that it's not usable anymore and change
this flag only under LOCK_active_mi. get_master_info() could wait (on
a condition) if this flag is set.
- Add a counter that Master_info is in use.
- Add a function to release Master_info.
- Call terminate_slave_threads() outside of free_key_master_info()
> Actually, there would still be the deadlock problem after introducing the
> new lock, because it is possible for a slave thread to run CHANGE MASTER!
> (and I think START SLAVE / RESET SLAVE as well). But that is probably a bug,
> does not seem good that this is possible. I guess these need a check to
> throw an ER_LOCK_OR_ACTIVE_TRANSACTION error, like STOP SLAVE.
Wouldn't the new lock LOCK_serialize_replication_admin_commands fix that?
> I think I also found another lock problem while reading the code, for
> MASTER_POS_WAIT() og SHOW BINLOG EVENTS. They both grab a Master_info under
> LOCK_active_mi, but then proceed to access it after releasing the lock. I do
> not see anything that prevents the Master_info to disappear under their feet
> if RESET SLAVE ALL is run in another thread? But again, that's a separate
> issue, I suppose. Probably there are some other tricky deadlock issues
> lurking here.
I checked the function mysql_show_binlog_events() but could not find
any access to
Master_info after mysql_mutex_unlock(&LOCK_active_mi) was released.
> I don't know. It seems splitting into a new
> LOCK_serialize_replication_admin_commands could solve the deadlock issue,
> maybe, if the other problems mentioned above are addressed. It seems very
> hard to convince oneself that this would not introduce new problems
> somewhere, the locking in replication is really complex and feels very
> fragile :-/ It doesn't really feel like something one would want to do in
> 10.1 (certainly not 10.0), but maybe 10.2?
> It's kind of a very edge-case problem (and seems to have been there for a
> _long_ time). Still, hanging the mysqld process hard, requiring kill -9, is
> also not nice...
Adding two new flags, one if master_info is in use and one if it's
unusable, shouldn't be
that hard to make reasonable safe in 10.1
I am now back in Finland and working. Feel free to call me to discuss
this any time.