maria-developers team mailing list archive

Thread
Date
Re: comments to the plugin.h

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Thu, 03 Jul 2014 13:52:54 +0200
Cc: MariaDB Developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <20140703082132.GA7882@meddwl.fritz.box> (Sergei Golubchik's message of "Thu, 3 Jul 2014 10:21:32 +0200")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Sergei Golubchik <serg@xxxxxxxxxxx> writes:

> I'm kind of lost here. I really don't like requiring engines to do more
> obligatory things, but I don't see a good way around it either.

Agree, that's how I felt while trying to think of _any_ way to solve these
problems. I ended up with current solution, but not entirely happy about it.

At least, we have some more ideas now to improve things.

What should be the next step?

I suggest that I could change the API to use the new thd_get_commit_order()
instead (see below for detailed suggestion). And try to implement the method
where InnoDB gets the information about commit order and does the kill of the
other transaction itself.

And I'll read up on services and use it for the new API.

And then I could prepare a new patch for you to look at.

Does this sounds reasonable? Or do you have another suggestion?

More detailed discussions below:


> I thought that if a transaction is rolled back, you can, technically
> force *all* later transactions to roll back too and redo them all.
> But that'd be kind of expensive.

It's not the expensive that worries me. The problem is that some of the
following transactions may not be possible to roll back.

Suppose we have T1, T2, T3. T1 and T2 are InnoDB DML, T3 is DDL or MyISAM.

If now we get a deadlock between T1 and T2 and kill T3 to resolve it, we can
seriously corrupt the replication, because we will be left with T3 half-way
executed. This is worse because there might not really have been any deadlock
- it could be just a too-low deadlock timeout.

This is the reason I did the fix as I did. In my patch, we will kill and retry
only T2, so T3 will be unaffected.

> If we could know whether an engine uses this new API, we could - like I
> wrote above - default to rolling back all transactions in a group after
> the failed one. And then the new API would've been merely an
> optimization. *That* would've been a big improvement.

Suppose with the current patch, the engine does not support the new API. T1
will get a lock timeout. It will retry, but get the lock timeout again, and
after slave_transaction_retries it will give up, and replication will stop.

So seems we have two alternative default behaviours (I mean if the storage
engine does not implement the new API):

1. As in current patch, replication will stop and the DBA will be made aware
of the issue and will have to handle manually.

2. As you suggest, we will kill T3, slave will continue automatically (with
some lock timeout delays), but we might have silently corrupted the slave
state.

(1) seems safer, while (2) will be a lot more convenient for the DBA in many
cases. Which do you think we should prefer?

I wonder if we could somehow check if T3 is safe to roll back? If it is, (2)
is both convenient and safe, and could be used. Many users use only InnoDB
normally and use DDL rarely. But I'm not sure how hard it would be to do this
check in the general case.

Maybe it would be worth to look into, as you said it would be much preferable
if the new storage engine API was just an optional optimisation.


>> 2. thd_deadlock_victim_preference() is only used when the _storage engine_
>> detects a deadlock, replication related or not.

>> 3. thd_need_ordering_with() is used by the storage engine to decide on lock
>> wait semantics.
>
> Which is, basically, the same as 2, the concept is the same. One
> function is used when deadlock is detected, the other is earlier, but
> it doesn't make them suficiently different.

I do not think it is the same, let me try to explain once more and maybe it
will become more clear one way or the other.

With just (1), thd_report_wait_for() or whatever we replace it with, the basic
problem is solved. In case of a deadlock between T1 and T2, we will kill T2,
let T1 continue, and later re-try T2. This works even without
thd_need_ordering_with().

But in some cases the deadlock, kill, and retry is unnecessary. This is about
the issue in MDEV-5914, for example:

  T1: UPDATE t1 SET secondary=NULL WHERE primary=1
  T2: DELETE t1 WHERE secondary <= 3

If T1 and T2 were group-committed together on the master, we know that it is
safe to run them in parallel, so we can relax the gap lock taken by T2 that
can block T1. Then we avoid having to rollback and retry T2.

But if T2 committed in a different group commit on the master, we do not know
that this relaxing of gap locks is safe. Because T2 may still run in parallel
with the COMMIT of T1, we need to keep the gap locks, so that T2 will wait if
it needs to access any row modified by T1.

So thd_need_ordering_with() really _is_ just an optimisation. We might omit it
completely from the patch, or we could keep it as a maybe useful optimisation.

One way to use the same function might be:

  thd_get_commit_order(thd, other_thd)
  -2    - thd commits before, and it is not in same group commit as other_thd
  -1    - thd commits before, but it is in same group commit as other_thd
   0    - thd and other_thd can commit in either order
   1    - other_thd commits before, but it is in same group commit as thd
   2    - other_thd commits before, and it is not in same group commit as thd

Is this something like what you want?


>> > I think it'll be much simpler with only one function.
>> Why? I think it's quite the opposite...
>
> Why? Because there's only one function with very simple and clear
> semantics - the server informs the engine what transaction must be
> committed first. The engine uses this information to choose deadlock
> victims, to resolve waits, etc. This is just one high-level function,
> that doesn't use concepts of gap locks, statement-based replication,
> locks that are released before commit, locks that don't cause waits, and
> whatever else one must keep in mind when using lower-level API.
>
> I think it's much simpler to use. It is more complex if you think of
> everything it might mean and do in different situation. Compare a modern
> smartphone and an old punch-card computer. An old computer is simpler
> internally, a smartphone is easier to use.

So I'm fine with whatever way we agree with in the end.

The reason I think my original API is simpler is that there is just one very
simple thing that the engine must do: call thd_report_wait_for(T1,T2) whenever
T1 is about to wait until T2 commits. Everything else is optional
optimisations that can be safely ignored. It seems that any conceivable engine
must already have easy access to knowledge about one transaction needing to
wait for another. And indeed, in InnoDB only a couple of lines need to be
added to handle this. So it is very simple to use.

With your suggestion, however, things are more complex for the engine. It
needs to implement additional deadlock detection, based on some enforced
commit order. The very concept of "enforced commit order" is probably
something new. And it needs to implement some way to resolve deadlocks with
another transaction T2 that may or may not be running inside the engine. So
again, a new concept is introduced, "Is transaction T running inside this
engine"?

So I still do not understand why you think the second is simpler, or easier to
use?

But I can try to see if I can implement the second idea. The main complication
I see is how to be sure if the T2 is running inside the engine or not. Maybe
like this:

1. Call thd_get_commit_order(T1, T2), we learn that we have a deadlock, so
need to kill T2.

2. Set a flag in T2 that it has become a deadlock victim.

3. Call the new thd_plugin_awake(), it will kill T2 if it is outside of the
engine.

4. It could be that T2 was executing outside the engine at point (2), but had
time to enter the engine while we were traversing the list of handlertons in
(3). But then we can add code at appropriate places in the engine to check for
the flag set in (2), and make sure T2 gets killed also in this case.

(Maybe the necessary logic is already present in innobase_kill_query(), I will
check).

Have I understood your suggestions correctly now, do you think?


> So, either one needs two functions (thd_deadlock_victim_preference and
> thd_need_ordering_with) or modified_non_trans_table should be moved out
> of it where it was. I'd prefer two functions, it was good that you put
> modified_non_trans_table inside thd_deadlock_victim_preference.

Ok, right, modified_non_trans_table is unrelated but it felt natural to move
it out of the storage engine and into a more general function.


> This is O(N**2) of function calls that immediately return, it's
> neglectable on the total cost of graph traversal. And anyway, it's only
> called before waits - the engine is going to wait anyway, one function
> call won't make this wait any noticeably longer.

You are assuming that the engine will need to do this O(N**2) graph traversal
anyway. That's true in InnoDB, which has a very primitive algorithm for
deadlock detection. But it might not be true in another engine.

The idea with thd_need_wait_for() was that it would be used by a storage
engine that would not normally need to loop over all other transactions. So it
could avoid needlessly implementing such a loop for _all_ transactions, where
it would only be needed for replication transactions (which will usually have
few waits, if any).

But let's omit it for now, we do not know of any engine that might need it,
and if someday we find one we can easily add something at that time.


>> One way would be to document that hton->kill_query must be safe to
>> call from within thd_report_wait_for(). Maybe that is the simplest
>> way. It should not be hard to fix the InnoDB code to make this work.
>> For example, InnoDB could just set a flag in the transaction that it
>> already has lock_sys->mutex held, thus avoiding the recursive mutex
>> lock.
>
> Yes, but it'd make a pretty difficult-to-document API, I'd rather avoid
> it, if possible.

Agree, it's not nice, I'd also like to avoid it.


Thanks,

 - Kristian.
Follow ups

Re: comments to the plugin.h
From: Sergei Golubchik, 2014-07-04
References

Re: comments to the plugin.h
From: Kristian Nielsen, 2014-07-01
Re: comments to the plugin.h
From: Sergei Golubchik, 2014-07-03