← Back to team overview

maria-developers team mailing list archive

Re: Conservative parallel slave is "too optimistic" for certain engines

 

Sergey Petrunia <sergey@xxxxxxxxxxx> writes:

> == Symptoms == 
> When one runs a parallel slave (mode=conservative) and replicates DML for 
> MyRocks table without a Primary Key, replication may stop with a
> ER_KEY_NOT_FOUND error.

> == A detail about conservative replication ==

> The remaining case is
> - storage engine that doesn't support gap locks
> - Row-based-replication of table without primary key
>
> In this case, finding the row to delete is done by scanning the table until we
> find the row where all columns match. The scan does not see the rows that
> the INSERT is about to INSERT, and DELETE stops with an error, "we could not
> find the row to delete".

Nice analysis. Normally it is guaranteed to be safe to start the next
transaction in parallel with the commit step of a prior. But as you
describe, RBR on table with no primary key is special, since it does not
really correspond to any SQL-level operation.

> 2. Change parallel slave to wait *for commit*. This should only be done if
> tables that are updated do not support Gap Locking. This is hard, it will
> require making risky changes on the SQL layer.

I think you can just call thd->wait_for_prior_commit() in the code that does
the RBR table scan on a table with no primary key. Eg.
Rows_log_event::find_row():

  if (m_key_info)
  {
    ...
  }
  else
  {
    DBUG_PRINT("info",("locating record using table scan (rnd_next)"));
    thd->wait_for_prior_commit();
    ...
  }

(Or do it only once, maybe in Rows_log_event::find_key() or similar, and
only if rli->mi->parallel_mode==SLAVE_PARALLEL_CONSERVATIVE).

This makes thread 2 wait for after thread 1 has completed commit, as needed.
Similar code is found in temporary_tables.cc, since temporary tables also
requires a full wait to be safe for parallel replication.

This will theoretically penalise a bit RBR transactions on tables without
primary key in conservative parallel replication. Well, RBR without primary
key has other problems and is strongly discouraged; and I don't know how
useful conservative mode is compared to optimistic.

> 3. Disallow parallel slave execution for tables without Gap Locking and without
> primary key. (Looks most promising but I am not sure if it is sufficient).

I don't think you need to completely disallow parallel replication, just
adding the full wait at the appropriate place should be sufficient.


Another option is to handle the error when it occurs by rolling back and
retrying. This is how it is done in optimistic mode (and I would guess that
this failure will not occur in optimistic mode). See eg.
is_parallel_retry_error() and convert_kill_to_deadlock_error(). In
optimistic parallel replication, this particular issue (a DELETE running
ahead of an INSERT and getting ER_KEY_NOT_FOUND) is normal, and handled by
rolling back the DELETE, doing wait_for_prior_commit(), and trying again.

Conservative mode is already somewhat optimistic, since there are some
corner cases in InnoDB that can cause conflicts even in conservative mode
and require rollback+retry. So internally, "conservative" and "optimistic"
are quite similar, and allowing ER_KEY_NOT_FOUND (or any normal error) to
trigger a retry is probably an easy change in the code.

Handling the error and retrying could be done without penalising any
non-conflicting cases, though it is perhaps conceptually a somewhat larger
change than just adding the extra wait in conservative RBR with no primary
key.

> I assume this was done to increase the parallelism: initial phases of DELETE
> can be ran in parallel with commit/prepare steps of the INSERT. This is safe to

In fact, all of the DELETE can potentially run in parallel with the commit
step of the INSERT, and they can even group-commit together which is the
important performance gain (but obviously not in cases like this when the
transactions are conflicting).

Hope this helps,

 - Kristian.


Follow ups

References