maria-developers team mailing list archive

Thread
Date
Re: MDEV-7502 Provisioning of slave - review

To: Martin Kaluznik <martin.kaluznik@xxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 24 Aug 2015 16:14:59 +0200
Cc: MariaDB Developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAMMGaZ7gbnuRrETJh7G_zDKebU5-KSe6Gn50E7GaqFnjqSSZiQ@mail.gmail.com> (Martin Kaluznik's message of "Thu, 13 Aug 2015 07:53:56 +0200")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (gnu/linux)
Martin Kaluznik <martin.kaluznik@xxxxxxxxx> writes:

> good to do now. And I would also like to ask you for review of
> existing code.
>
> Github link: https://github.com/f4rnham/server/tree/MDEV-7502
> Diff: https://github.com/MariaDB/server/compare/10.1...f4rnham:MDEV-7502.diff

Please find my review inline in patch, below.

General stuff:

> I have created new log event as way to signalize slave successful end
> of provisioning, it is last event IO thread writes to relay log before
> termination, and last event SQL thread reads from relay log. Test
> cases are using Slave_SQL_Running value from SHOW SLAVE STATUS command
> to wait until provisioning is completed. It works there, but I am not
> sure about its convinience in real world usage.

Hm, I am not sure either.
Another option would be for LOAD DATA FROM MASTER to block until
provisioning is done (failed or successful).


Other than review, please check Buildbot results carefully for any
failures. Pay particular attention to the valgrind build (for memory
leaks). Also pay special attention to random failures in your new test cases
that happen only occasionally. Such sporadic / random failures often turn up
in replication tests.


You need in test cases that use debug_dbug: --source include/have_debug.inc


What is the largest data set you tested with?
Please do a (manual) test with a large data set.

For example, you could run sysbench with a 10GB or 50GB database, let it run
for say 10 minutes, and 5 minutes in you start a LOAD DATA FROM MASTER on a
slave off the server running the sysbench.


Please check comments below on how (and if) the whole approach with
provisioning in parallel with active transactions is correct and makes the
slave always identical with the master at the end.


Is there any issue with foreign keys?

An insert to one table could fail if a foreign key constraint against
another table is not yet satisfied due to rows not yet provisioned. A CREATE
TABLE referencing another table might fail.

Maybe look at if mysqldump is doing something to avoid these kinds of
issues.

A minimal solution is to investigate possible problems, and document them as
limitations. Eg. it is reasonable to document that users must disable
foreign keys during provisioning, if necessary.


What happens if the server that does LOAD DATA FROM MASTER provisioning has
--log-bin and --log-slave-updates? What happens if a third-level slave is
connected to the server that does LOAD DATA FROM MASTER? I think this needs
to be tested.


Comments on the patch:

> +BOOLEAN _db_keyword_locked_(CODE_STATE *cs, const char *keyword, int strict)

Can you explain the race that results without this locking? Is it because
_db_keyword_() is scanning the list without any locking, so if the list
changes during the scan we can crash?


> +#define DBUG_EXECUTE_LOCKED(keyword,a1) \
> +        do {if (_db_keyword_locked_(0, (keyword), 0)) { a1 }} while(0)

You need empty definition for non-debug build (but I guess you already saw
in buildbot run).


> +--let $slave_param= Slave_SQL_Running
> +--let $slave_param_value= No
> +--let $slave_error_param= Last_SQL_Errno
> +--source include/wait_for_slave_param.inc

Is there a reason not just to use include/wait_for_slave_sql_to_stop.inc
here?



> +# Ensure that master and slave log names and positions are equal
> +--connection master
> +--let $binlog_file_master= query_get_value(SHOW MASTER STATUS, File, 1)
> +--let $binlog_pos_master= query_get_value(SHOW MASTER STATUS, Position, 1)
> +--connection slave
> +--let $binlog_file_slave= query_get_value(SHOW SLAVE STATUS, Master_Log_File, 1)
> +--let $binlog_pos_slave= query_get_value(SHOW SLAVE STATUS, Read_Master_Log_Pos, 1)

Is there a reason you do not use GTIDs here instead (or in addition)?


> +# echo Binlog file name master - $binlog_file_master;
> +# echo Binlog file name slave - $binlog_file_slave;
> +# echo Binlog position master - $binlog_pos_master;
> +# echo Binlog position slave - $binlog_pos_slave;

Please remove this left-over debugging code.


> +++ b/mysql-test/suite/rpl/t/rpl_provisioning_base.inc

> +eval CREATE TABLE test.test_1 (a int, PRIMARY KEY (a)) ENGINE=$engine_type;
> +eval CREATE TABLE test.test_2 (a int, b int, c int, PRIMARY KEY (a)) ENGINE=$engine_type;

Try adding also some foreign key contraints to the table for the test.

> +# Start provisioning
> +--connection slave
> +LOAD DATA FROM MASTER;


At this point, master has full binlogs. So just START SLAVE would do the
same thing, just by re-playing the binlog.

I think the test should use FLUSH LOGS on the master a couple of times
during the initial setup of data, and then PURGE BINARY LOGS to delete some
(but not all) of that binlog data, to test that the slave can still
provision without complete binary logs.

> +SET GLOBAL debug_dbug="+d,provisioning_wait";

> +SET GLOBAL debug_dbug="-d,provisioning_wait";

Earlier, when I did this, I ended up with tons of debug output in the error
log after setting the "-d,..." which is annoying. Setting an empty "+d"
option was enabling debug trace. Maybe this is now fixed, if so then all the
better. But if you get this problem (with stray debug trace in error log),
you can use the following pattern instead:

SET @old_dbug= @@GLOBAL.debug_dbug;
...
SET GLOBAL debug_dbug= @old_dbug;


> +#
> +# Attempt to start provisioning while not in GTID mode
> +#

Is this an old comment? It seems none of the tests really use any GTID.

Does provisioning work in GTID mode? It should be tested.


> +#
> +# Attempt to start provisioning while SQL thread is running
> +#

What if IO thread is running? Should probably be tested also.


> +# Creation of database `test` depends purely on timing and it has nothing to do with this test case
> +--disable_warnings
> +DROP DATABASE IF EXISTS test;


What is the purpose of this?
It sounds like there is a potential race in the test case? Maybe there is a
better way to handle it.


> +START SLAVE;
> +
> +# Cleanup after test case
> +--connection slave
> +STOP SLAVE;
> +RESET SLAVE;
> +# Creation of database `test` depends purely on timing and it has nothing to do with this test case

It is better to use the include/start_slave.inc and include/stop_slave.inc.
They prevent some races, which might solve some of the problems that your
comment about timing seems to hint at. Also, it is usually best to
synchronise the slave with the master before stopping, to avoid it being
stopped at a random point.


> +#
> +# Bug: Deadlock occurs when IO thread fails with error before dump request
> +#

What do you mean? That the current patch has a bug triggered by this test
case? Or that there was a bug, fixed now, and this is a test for it? Please
explain better in the comment.


> diff --git a/mysql-test/suite/rpl/t/rpl_provisioning_stress.test b/mysql-test/suite/rpl/t/rpl_provisioning_stress.test
> new file mode 100644
> index 0000000..8141015
> --- /dev/null
> +++ b/mysql-test/suite/rpl/t/rpl_provisioning_stress.test
> @@ -0,0 +1,48 @@
> +--skip Test requires manual setup
> +
> +# Stress test for MDEV-7502 Automatic provisioning of slave
> +# Using Random Query Generator v2.2.0 release, downloadable from https://launchpad.net/randgen/+download

I think it is great that you made this kind of test.

But maybe it would be better to put it somewhere else, to not have a test in
the normal test suite that is always unconditionally skipped. Eg. it could
be put in /tests/, and one would copy it into the test suite, instead of
removing the --skip.


> +void Provisioning_done_log_event::print(FILE* file,

> +  print_header(&cache, print_event_info, FALSE);
> +  my_b_write_string(&cache, "\tProvisioning done\n");

Hm, this can occur in the relay log, but not in the binlog, right?

I think it is ok not to have a test of this in mysql-test-run, we generally
can only test the "short-form" in there to avoid dependency on time and
such. Just be sure to check manually that the output is correct.


> @@ -9718,23 +9760,29 @@ int Rows_log_event::do_apply_event(rpl_group_info *rgi)

> +      // In provisioning mode, silently ignore table not found error
> +      if (!rgi->thd->slave_thread || !rli->mi->provisioning_mode ||
> +          !provisioning_error_code(actual_error))

Better use thd than rgi->thd here (just for consistency).


> +        // Ignore certain errors during provisioning

What thoughts did you have on whether just ignoring these kinds of errors
would always result in the correct result of provisioning?

I suppose since we are copying data fresh from the master, it should not be
possible that we miss a "real" error in this way.

Will things work correctly, if we have multiple rows in a single
Rows_log_event, and some of them succeed while others fail? In both MyISAM
and InnoDB?

Can an argument be made to prove that, if we ignore an insert or update
here, we are sure to later write the correct data as part of provisioning?
For both InnoDB and MyISAM?

Similarly, can it be proven that it is impossible to ignore a delete and
still later wrongly provision the deleted row?

I suppose it would be something with binlog ordering and isolation modes
for the InnoDB case - and some table locking semantics wrt. MyISAM?


> +
> +  /*
> +    Semaphore used by LOAD DATA FROM MASTER command for test cases, its role is
> +    to ensure, that slave's IO thread is connected to master before test case
> +    can continue.

> +  volatile bool dump_requested_semaphore;

Shouldn't this be under #ifndef DBUG_OFF ?


> diff --git a/sql/rpl_provisioning.cc b/sql/rpl_provisioning.cc

> +void quote_name(char const *name, char *buff)

Try to use append_identifier() instead. Or maybe my_snprintf("%`s").
I notice that append_identifier() handles a configuration where "" is used
to quote identifiers instead of ``, you should probably test if this causes
a bug in your code, btw.


> +provisioning_send_info::provisioning_send_info(THD *thd_arg)
> +  : thd(thd_arg)
> +  , connection(new Ed_connection(thd))
> +  , row_batch_end(NULL)
> +  , row_buffer(my_malloc(ROW_BUFFER_DEFAULT_SIZE, MYF(0)))
> +  , row_buffer_size(ROW_BUFFER_DEFAULT_SIZE)
> +  , phase(PROV_PHASE_INIT)
> +  , error(0)
> +  , error_text(NULL)
> +  , event_conversion_cache(NULL)
> +{
> +}

So this constructor will probably cause the server to crash in case of
out-of-memory, right?

Usually, code tries to fail gracefully with ER_OUT_OF_RESOURCES /
ER_OUTOFMEMORY if malloc() fails (though its not well tested, probably).


> +void provisioning_send_info::close_tables()
> +{
> +  trans_commit_stmt(thd);


I see no call to commit the entire transaction (as opposed to just the
statement here). This may be intentional (autocommit?), in any case please
add a comment here explaining how the transaction is committed
wrt. statement-transaction and full transaction.

Maybe add a test where the server is running with auto-commit globally
disabled? Though I am not sure if the short mysql-test-run.pl tests would
notice it if everything was done as one long transaction (though large
production sites undobtedly would ;-).


> +bool provisioning_send_info::send_query_log_event(DYNAMIC_STRING const *query,

Is there a particular reason that you chose DYNAMIC_STRING over the String
class? I think String is more widely used for building queries and such?

A common use pattern with String is to allocate a reasonably-sized buffer on
the stack and initialise a String object from it. Then in the common case of
smallish result, no my_malloc() calls are needed.


> +  if (cs_info)
> +  {
> +    // thd->variables.character_set_client->number
> +    int2store(evt.charset, cs_info->cs_client);
> +    // thd->variables.collation_connection->number
> +    int2store(evt.charset + 2, cs_info->cl_connection);
> +    // cs_info->cl_server
> +    int2store(evt.charset + 4, thd->variables.collation_server->number);
> +    // thd->variables.collation_database->number
> +    evt.charset_database_number= cs_info->cl_db;

Can you explain why you need to consider the charsets/collations of the
connection and client? Could you not just always use the system character
set for generating queries?


> +    if (my_strcasecmp(system_charset_info, "VIEW", type->str) == 0)
> +    {
> +      char name_buff[NAME_LEN * 2 + 3];
> +      DYNAMIC_STRING *str= (DYNAMIC_STRING*)my_malloc(sizeof(DYNAMIC_STRING),
> +                                                      MYF(0));
> +
> +      quote_name(database->str, name_buff);
> +      init_dynamic_string(str, name_buff,
> +                          // Both lengths + dot and quotes, will be reallocated
> +                          // only in special cases
> +                          database->length + name->length + 5,
> +                          5);
> +      dynstr_append(str, ".");
> +      quote_name(name->str, name_buff);
> +      dynstr_append(str, name_buff);
> +
> +      views.push_back(str);
> +    }
> +    else if (my_strcasecmp(system_charset_info, "BASE TABLE", type->str) == 0)
> +      tables.push_back(my_strdup(name->str, MYF(0)));

Why do you need to prepend the database name in VIEW case but not in TABLE
case? Please add a comment explaining that.

> +    else
> +      DBUG_ASSERT(false &&
> +                  "Unexpected type of table returned by 'SHOW FULL TABLES'");

Probably you should add an error return here, that will cause provisioning
to fail in the non-debug build, rather than silently press on.


> +bool provisioning_send_info::event_to_packet(Log_event &evt, String &packet)
> +{
> +  // Reset cache for writing
> +  reinit_io_cache(event_conversion_cache, WRITE_CACHE, 0, false, true);
> +
> +  if (evt.write(event_conversion_cache))
> +  {
> +    error_text= "Failed to write event to conversion cache";
> +    return true;
> +  }
> +
> +  if (reinit_io_cache(event_conversion_cache, READ_CACHE, 0, false, false))
> +  {
> +    error_text= "Failed to switch event conversion cache mode";
> +    return true;
> +  }
> +
> +  packet.set("\0", 1, &my_charset_bin);
> +  if (packet.append(event_conversion_cache,
> +      event_conversion_cache->write_pos - event_conversion_cache->buffer))
> +  {
> +    error_text= "Failed to write event data to packet";
> +    return true;
> +  }

Hm. So first the event is constructed in-memory. Then it is written to an
IO_CACHE. Then it is read back in again into a byte buffer. And then it is
written to the slave's client socket.

Hm, I see, the client socket we have only as a NET, not an IO_CACHE. But the
Log_event subclasses need an IO_CACHE to construct a serialised event.

Hm, that's clunky, but I don't see an easy way to avoid it. But please add a
comment explaining why this is necessary.

Also, doesn't the IO_CACHE spill to disk if the data written becomes large?
So wouldn't a very large event (eg. a row containing a large BLOB) fail
here? Because the IO_CACHE only contains part of the data in-memory? I am
not 100% sure about how this works, please investigate and test with a large
row (if you did not already).


> +  // Set current server as source of event, when slave registers on master,
> +  // it overwrites thd->variables.server_id with its own server id - and it is
> +  // then used when writing event
> +  int4store(&packet[SERVER_ID_OFFSET + 1], global_system_variables.server_id);
> +  return false;

I think this will fail if event checksums are enabled? Or rather, the patch
does not seem to handle binlog checksums at all. Try adding a test that
enables binlog checksums on master and slave, and check that everything
works.

And also - why do you need to change this in the first place? Doesn't the
server_id get set correctly when the event is initially created?


> +bool provisioning_send_info::send_event(Log_event &evt)
> +{
> +  String packet;
> +
> +  // Slave will accept this event even if it modifies data structures
> +  evt.flags|= LOG_EVENT_PROVISIONING_F;
> +
> +  if (event_to_packet(evt, packet))
> +    return true;
> +
> +  if (my_net_write(&thd->net, (uchar*)packet.ptr(), packet.length()))
> +  {
> +    error_text= "Failed to send event packet to slave";
> +    return true;
> +  }

Yeah...
It would be so much more elegant if we could wrap the thd->net in an
IO_CACHE, and write directly to the client.

Did you investigate if this would be a possibility, without too much extra
work?


> +int8 provisioning_send_info::send_table_data()

> +  if ((error= hdl->prepare_range_scan(row_batch_end, NULL)) != 0)

> +  if (table->s->primary_key == MAX_KEY)
> +  {
> +    error_text= "Table does not contain primary key";
> +    close_tables();
> +    return 1;
> +  }

It seems more natural to me to check for missing primary key _before_
prepare_range_scan(). (Otherwise one wonders how it is possible to prepare a
range scan on a table without _any_ key? I suppose it works to do a table
scan instead, but better do the check first to avoid having to wonder).


> +    while (packed_rows < PROV_ROW_BATCH_SIZE &&
> +           (error= hdl->read_range_next()) == 0);

I think there should be two stop criterias. One is max number of rows
reached, another is size of rows.

At least, you need to protect against exceeding --max-packet-size. Otherwise
the slave will be unable to receive the too large packet.

Please add a test case that checks large rows and that it correctly splits
row events to not exceed max-packet-size (of course at least one row must
fit in one Rows_log_event).

Check the code for normal binlogging to see how it splits up Row_log_events
to avoid creating too large an event.

> +  table->default_column_bitmaps();
> +  close_tables();


So the critical issue here is: how can we be sure that we will not miss rows
during provisioning, by racing between provisioning events and normal binlog
events?

This needs to be clearly thought out, and then explained in a comment
somewhere in the code.

For example, you write the events to the slave client socket before closing
the table and releasing locks. This seems critical (and correct) - because
this ensures that if there is a DELETE made just after the provisioning read
of the table, that DELETE will be sent only _after_ sending the provisioning
rows (so we do not lose the DELETE). This should be explained in a comment.

But what about isolation level (for InnoDB)? Don't we need to run with a
sufficiently high isolation level to be correct? Intuitively, I thought so,
but maybe not, at least you have nothing in the patch about isolation level
that I can see.

What about this scenario?

1. Provisioning reads rows 20-29, sends them to slave.

2. An UPDATE runs, it updates row 32 to be now row 25, it is written to the
binary log and sent to the slave.

3. Provisioning reads rows 30-39, sends them to the slave.

On the slave, the UPDATE from 2 should fail - because row 32 does not exist
at this point.

But then, it seems to me that the row 25 will never appear on the slave,
right? So provisioning will be incorrect. Or am I missing something?

I wonder if the solution to this particular case is that during
provisioning, an UPDATE (at least one that changes primary key value) must
be logged as a DELETE (of the old row) + an INSERT (of the new
row). Alternatively, maybe the slave can during provisioning execute the
UPDATE regardless of the missing before image, still writing the after image
(row 25) to the table.

More generally, we really need an explanation of how the provisioning
ensures that the end result on the slave is identical to the master -
otherwise there may be other problems like this hidden.

Also, it would be really good if you could make a test case for the above
potential issue. You can get some ideas from how I use DBUG_EXECUTE_IF
together with DEBUG_SYNC in the rpl_parallel_*.test test cases. But I know
it can be a lot of work to get such tricky race-proviking test cases to work
reliably.


> diff --git a/sql/rpl_provisioning.h b/sql/rpl_provisioning.h

> +// FIXME - Farnham
> +// Header
> +
> +#pragma once

We do not use this #pragma once in mariadb (maybe that is what you intended
to fix with the FIXME, so mentioning just in case).

I assume you will take care of other such FIXME's in the patch yourself
before pushing.


> +/*
> +  How much rows from table will be sent in one provisioning tick
> +  FIXME - Farnham
> +  Dynamic value
> +*/
> +#define PROV_ROW_BATCH_SIZE 5

Right. Probably just a matter of adding something to sql/sys_vars.cc


> -  if (rpl_load_gtid_slave_state(thd))
> +  if (!mi->provisioning_mode && rpl_load_gtid_slave_state(thd))

Actually, what should happen with the GTID position for LOAD DATA FROM
MASTER? Just clearing it probably is not correct, in case user wants to LOAD
DATA FROM MASTER from multiple masters.

Maybe the user should clear it herself, if necessary, and then it will be
updated by the normal replication stream from master.

But why do you skip loading it here for provisioning mode? That does not
seem right. First, it will normally be loaded already anyway. And second, if
not loaded, then just having the wrong value in-memory (as compared to in
the mysql.gtid_slave_pos table) does not seem correct?


> +    DBUG_EXECUTE_IF("provisioning_test_running",
> +                    if (!slave_errno)
> +                      while (!mi->dump_requested_semaphore)
> +                        /* pass */;);

Maybe put a my_sleep(20000) in with the /* pass */.


> +  case PROVISIONING_DONE_EVENT:
> +    mi->abort_slave= true;
> +    /*
> +      Fall through to default case ... and through debug code, nothing wrong
> +      should happen
> +    */

That seems a bit fragile.
Better do like in other places: goto default_action;


> +      // In provisioning mode, master starts sending data from current position
> +      if (flags & BINLOG_DUMP_PROVISIONING_MODE)

So provisioning assumes non-GTID mode?

Maybe that is ok...
But it should still be tested what happens if provisioning is attempted in
GTID mode.

Perhaps either GTID or non-GTID mode can work the same, since provisioning
does not need to start at any particular position - and at the end of
provisioning, both GTID and non-GTID position should be correct.


> +bool
> +Ed_connection::execute_direct(char *sql_text, size_t length)
> +{
> +  LEX_STRING tmp_query;
> +  tmp_query.str= sql_text;
> +  tmp_query.length= length ? length : strlen(sql_text);
> +
> +  return execute_direct(tmp_query);
> +}

You are not actually ever calling this without passing length.
So please make this an inline method in the .h file, with mandatory length
argument, that does not check for zero length or call strlen().


 - Kristian.