← Back to team overview

maria-developers team mailing list archive

Re: comments to the plugin.h


Hi, Kristian!

On Jul 01, Kristian Nielsen wrote:
> >> +  The storage engine can report such conflicting locks using this
> >> call. This +  will allow parallel replication to detect such
> >> conflicts and resolve the +  deadlock (by killing the second
> >> transaction to release the locks that the +  first is waiting for,
> >> and then later re-try the second killed transaction).
> >
> > What will happen if the engine doesn't? Deadlocks will timeout.
> > Which is annoying, but it only affects the performance, not the
> > correctness.
> No, unfortunately that is not so.
> Let's say we have T1 and T2. T1 goes to wait on a row lock held by T2.
> Then T2 completes its prepare phase, it goes to the group commit queue
> and waits for T1 to complete and commit for both of them. T1 gets lock
> wait timeout, so T1 will roll back and try again. But T1 will just go
> to wait on the same row lock again that is held by T2. So T1 will
> retry over and over until slave_transaction_retries is exceeded, at
> which point replication will abort with an error.
> There is no lock wait timeout when T2 waits in the group commit queue.


I thought that if a transaction is rolled back, you can, technically
force *all* later transactions to roll back too and redo them all.
But that'd be kind of expensive.

On the other hand, requiring *every* engine to report waits... Not good
either. And you cannot really enforce that, there's no way to know
whether an engine does it (compare with handler methods - a required
method can be simply declared abstract and the engine cannot possibly
miss it).

I'm kind of lost here. I really don't like requiring engines to do more
obligatory things, but I don't see a good way around it either.

If we could know whether an engine uses this new API, we could - like I
wrote above - default to rolling back all transactions in a group after
the failed one. And then the new API would've been merely an
optimization. *That* would've been a big improvement.

Can we deduce this somehow, without requiring the engine to declare a

> >> +void thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
> >
> > You kill the second thread in a background.
> > What if you convey the ordering to the engine and let it know about a
> > deadlock? Like, if you return 1 from thd_report_wait_for()
> > it means a deadlock and the engine will kill the transaction
> > just normally as for any normal deadlock case.
> The problem is that T2 could be executing anywhere in the server,
> including server code or another storage engine, where InnoDB does not
> really have any possibility to kill it.
> This is different from a normal InnoDB deadlock, which only happens if
> both T1 and T2 are doing an actual row lock wait inside the storage
> engine itself. But in this case, only T1 is waiting, T2 is continuing
> whatever processing it is doing, eventually it will end up in the
> group commit queue in binlog or in wait_for_prior_commit(), and wait
> for T1 to commit first.

I don't know whether you've seen what I suggested on irc, so I'll
repeat. Provide a thd_kill(thd) helper that an engine can call to kill
a specific thd. It will *not* call ha_kill_query for this particular
engine, but will call for other engines. The logic is - the engine knows
how to kill a thread that's stuck somewhere in this very engine. But it
needs help to kill a thread that's somewhere else. Also it'll help to
avoid mutex issues that you were seeing.

> >> +int thd_need_wait_for(const MYSQL_THD thd);
> >
> > I'd suggest to remove it. If you think it's useful, you can call it
> > from your thd_report_wait_for() and spare storage engine the
> > knowledge of yet another API function.
> See discussion above. The intention is to avoid lots of calls to
> thd_report_wait_for() from the storage engine for non-replication
> threads. So calling it from inside thd_report_wait_for() makes no
> sense.

If it's on the very top of thd_report_wait_for() then it'll be one
function that immediately returns.

> I don't understand the benefits of reducing the functionality
> available in the API when that functionality is completely optional?

The benefit is the keeping API as simple as possible, while still fully
functional. It's very important to have the simple API, even if it's
doing something complex underneath. See pluggable authentication, for
example, the big comment inside native_password_authenticate() function.

> I included this function because without it, there seems no way for a
> storage engine to avoid introducing O(N**2) overhead for N
> transactions contending for a row lock. That seemed unacceptable to
> me.

This is O(N**2) of function calls that immediately return, it's
neglectable on the total cost of graph traversal. And anyway, it's only
called before waits - the engine is going to wait anyway, one function
call won't make this wait any noticeably longer.

> >> +int thd_deadlock_victim_preference(const MYSQL_THD thd1, const MYSQL_THD thd2);
> >
> > Okay, I see why this one is needed.
> >
> > On the other hand...
> > It seems like it's the only function that you realy need.
> > Because
> >    thd_report_wait_for(const MYSQL_THD thd, MYSQL_THD other_thd);
> > can be replaced with
> >   if (thd_deadlock_victim_preference(thd, other_thd) < 0) goto deadlock;
> > And
> >    thd_need_wait_for(const MYSQL_THD thd);
> > can be removed (replaced with noop). And
> >    thd_need_ordering_with(const MYSQL_THD thd, const MYSQL_THD other_thd);
> > can be replaced with
> >   thd_deadlock_victim_preference(thd, other_thd) != 0
> I don't understand the purpose of using the same function for these?
> They do completely different things, using a single function will only
> cause confusion, it seems to me:
> 1. thd_report_wait_for() is used to allow the _replication code_ to
> detect a deadlock related to ordering.

Yes, but my suggestion was to remove it, remember? To let the engne kill
the other thread.

> 2. thd_deadlock_victim_preference() is only used when the _storage engine_
> detects a deadlock, replication related or not.

Right. But my suggestion was to make it replication related. I didn't
know you've put modified_non_trans_table in there too.

I agree that modified_non_trans_table affects deadlock victim
preferences, but does not affect commit order.

So, either one needs two functions (thd_deadlock_victim_preference and
thd_need_ordering_with) or modified_non_trans_table should be moved out
of it where it was. I'd prefer two functions, it was good that you put
modified_non_trans_table inside thd_deadlock_victim_preference.

> 3. thd_need_ordering_with() is used by the storage engine to decide on lock
> wait semantics.

Which is, basically, the same as 2, the concept is the same. One
function is used when deadlock is detected, the other is earlier, but
it doesn't make them suficiently different.

> > Although in that case thd_deadlock_victim_preference() should better be
> > renamed, because it won't be a generic "victim preference",
> > but specifically victim preference, depending on the fixed transaction
> > commit ordering. A better name could be, for example,
> >
> > thd_get_commit_order(thd, other_thd)
> >  -1    - thd first
> >   0    - unspecified
> >  +1    - other_thd first
> How can the storage engine know the difference between T1 and T2 being
> in the same group commit, or in different group commits?

Why the engine needs to know that?

> How can the storage engine inquire about need for gap locks, without
> accidentally reporting a lock wait to the upper layer?

thd_get_commit_order() does not report locks waits to the upper layer,
that was the point, see above - no reporting, the engine does the

> > I think it'll be much simpler with only one function.
> Why? I think it's quite the opposite...

Why? Because there's only one function with very simple and clear
semantics - the server informs the engine what transaction must be
committed first. The engine uses this information to choose deadlock
victims, to resolve waits, etc. This is just one high-level function,
that doesn't use concepts of gap locks, statement-based replication,
locks that are released before commit, locks that don't cause waits, and
whatever else one must keep in mind when using lower-level API.

I think it's much simpler to use. It is more complex if you think of
everything it might mean and do in different situation. Compare a modern
smartphone and an old punch-card computer. An old computer is simpler
internally, a smartphone is easier to use.

> Hm.
> So my idea was to make the new API simple and generic, the upper layer
> would handle most of the detail about commit ordering or whatever. The
> storage engine only needs to think about simple stuff like lock waits
> or deadlock victims, both of which it likely already needs to deal
> with anyway.
> But if I understand correctly, another idea could be to make the new
> API specific just to this particular problem of enforced commit order
> in parallel replication. So the storage engine would be able to know
> of such ordering, and should then handle any necessary actions on this
> by itself?

Eh, not quite. The idea was to make the new API simple and generic *and*
high-level, where the engine would be able know about the enforced
commit order and act accordingly, instead of giving her lots of
low-level orders about what it should do in every specific internal

> > And you won't need background killing for slave threads
> The need for background killing comes because it does not work to call
> THD::awake() from the context of where the InnoDB deadlock detector
> calls thd_report_wait_for(). The problem is that innobase_kill_query()
> does a recursive mutex lock on lock_sys->mutex.
> We discussed on IRC to fix this so background kill is not necessary.
> One way would be to document that hton->kill_query must be safe to
> call from within thd_report_wait_for(). Maybe that is the simplest
> way. It should not be hard to fix the InnoDB code to make this work.
> For example, InnoDB could just set a flag in the transaction that it
> already has lock_sys->mutex held, thus avoiding the recursive mutex
> lock.

Yes, but it'd make a pretty difficult-to-document API, I'd rather avoid
it, if possible.

> You also had a different suggestion:
> > here's another idea. To have thd_get_commit_order() as I thought
> > originally *and* allow innodb to kill other threads, that is, add a
> > function for plugins to invoke THD::awake. This function should not
> > call into the engine that invoked it. That is, supposedly the engine
> > knows how to kill threads stuck in that engine, to kill other
> > threads it should use this new helper.
> > In that case innodb can detect the deadlock and can kill the other
> > thread and it won't break on lock mutex
> > The underlying logic of this killer helper would be "an engine knows
> > how to kill a transaction waiting in this engine, so we wouldn't
> > bother doing that. but to kill a transaction waiting elsewhere the
> > engine needs help"
> It seems better to me if all the details of enforcing commit order and
> killing worker threads in case of deadlocks could be kept inside the
> replication code?  Otherwise every storage engine would potentially
> need to re-implement this?  I think that is what I tried to do with my
> proposed storage engine API, to cleanly separate the storage engine's
> responsibilities, which is about row-level locks and internal deadlock
> victims, from the replication codes responsibilities.

But the engine already has all that code. Wait-for graph shows what
transaction needs to wait for for, and the suggested function simply
helps the engine to resolve waits, by specifying what transaction can
*not* wait for. Then it goes to the normal deadlock resolution in the

> Wouldn't it be simpler to do it like this?
> 1. InnoDB deadlock detector reports lock waits to whatever function
> serves the thd_report_wait_for() purpose.
> 2. We document that thd_report_wait_for() may need to call
> hton->kill_query().
> 3. We fix InnoDB to handle (2) correctly (assuming this is possible to
> do in a simple way).

Then you'd need to document the exact semantics of
thd_report_wait_for(), with all its "should not report false positives,
such as ... and ...". And document that thd_report_wait_for() might call
ha_kill_query, so one would always need to keep this mutex semantics in
mind, because errors won't be easy to find, it's not a use case that
happens often.

Fixing InnoDB, on the other hand, is pretty easy. If we wouldn't need to
think of the API part, but only do an internal fix, I'd use this

> > And anyway, this absolutely must be in a service, not in the global
> > plugin.h.
> Why? My understanding was that we do not support ABI stability for
> storage engines.

Because anything you put in the global plugin.h is not for storage
engines only. It's for every and any plugin out there - up for grabs.
The server has no idea whether any plugin uses that or not.

If we would have a separate header for storage engine API functions
only, that would've not been an issue.

> > See? In this patch you've removed a function from the global plugin
> > API
> Right, the thd_rpl_is_parallel(). That should have never be allowed
> into the code in the first place, it was a temporary hack only. :-(

Sure. I mean that formally removing a function from the API, creates a
new API incompatible with the old one. Which, again, formally, should
invalidate *all* plugins using the old API. If the API is global - it
invalidates all plugins possible.

> > header, strictly speaking, you must increase the API version - the
> > major number of - making all older plugins incompatible with the new
> > server.  This is a bit extreme, taking into account that no plugin
> > actually uses thd_rpl_is_parallel()
> > To avoid this, please only add new APIs as services.
> > See libservices/HOWTO
> I can do that, if you want. It just seems a needless complication to
> me...

The point is to move these functions from a global API (that affects all
plugins) to a small API that affect only plugins that explicitly use it.

Services really make it much easier to change APIs in the future,
because these changes don't affect other plugins.

> Are there any performance penalties for using a service over a normal
> function call (for built-in storage engines)? I seem to recall there
> is not?

None whatsoever. For built-in storage engines this is #ifdef-ed and they
get normal function calls.


Follow ups