← Back to team overview

maria-developers team mailing list archive

Problem with debug_sync, THD::enter_cond(), and kill ...

 

I got a crash in my parallel replication test case, and this one turned out to
be caused by a fundamental  problem in debug_sync. I am not sure how to solve
it, so I wanted to explain the issue in case others have some ideas.

The crash happens in THD::awake() when a thread is killed:

    mysql_mutex_lock(&mysys_var->mutex);
    ...
        int ret= mysql_mutex_trylock(mysys_var->current_mutex);
        mysql_cond_broadcast(mysys_var->current_cond);
        if (!ret)
          mysql_mutex_unlock(mysys_var->current_mutex);

The problem is that mysys_var->current_mutex changed between the trylock() and
the unlock(), so we unlock a different mutex than the one we locked - ouch!

The mutex changed because of this code in debug_sync.cc:

      if (thd->mysys_var)
      {
        old_mutex= thd->mysys_var->current_mutex;
        old_cond= thd->mysys_var->current_cond;
        thd->mysys_var->current_mutex= &debug_sync_global.ds_mutex;
        thd->mysys_var->current_cond= &debug_sync_global.ds_cond;
      }

There is no mutex protection here.

So this means that it is not safe to use debug_sync inside
enter_cond()/exit_cond() if kill will be used on the thread at that point.
Since I have a lot of tests for parallel replication where I test exactly for
correct error handling when killing threads at various points in the parallel
processing, it is not too surprising that I would eventually be hit by this :)

I am going to avoid using debug_sync inside enter_cond()/exit_cond() for now.
But any ideas for how to solve this properly? This is particularly nasty to be
hit by, as it is not at all obvious that this use of debug_sync should be a
problem. And it is not exactly easy to guess from the failure what the real
problem is - assuming one can even reproduce the failure, the window of
opportunity in the race is after all rather quite small.

 - Kristian.


Follow ups