← Back to team overview

maria-developers team mailing list archive

Re: c4de76aeff8: MDEV-17554 Auto-create new partition for system versioned tables with history partitioned by INTERVAL/LIMIT

 

Hi Sergei,

On Thu, Apr 8, 2021 at 3:30 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
>
> Hi, Aleksey!
>
> On Apr 07, Aleksey Midenkov wrote:
> > > > +
> > > > +--echo # INSERT, INSERT .. SELECT don't increment partitions
> > >
> > > it's not really "increment", better say "don't auto-create"
> >
> > Actually I like "increment" more. "Auto-create" overcomplicates phrases:
> >
> > --echo # Increment from 3 to 5
> > --echo # Increment from 3 to 6, manual names, LOCK TABLES
> > --echo # Multiple increments in single command
> >
> > Besides "increment" is correct because PARTITIONS number is incremented.
>
> Sure.
> "Increment the number of partitions", this is fine.
> "Auto create partitions" is also fine.
> "Increment partitions" is meaningless.

It is obvious from the context that we are talking about the number,
not partitions themselves. Treat "partitions" as PARTITIONS keyword
and the increment is attached to a number right next to it. That's
quite a sense, isn't it?

>
> > > > +insert into t1 values (1);
> > > > +
> > > > +create or replace table t2 (y int);
> > > > +insert into t2 values (2);
> > > > +
> > > > +insert into t1 select * from t2;
> > > > +insert into t2 select * from t1;
> > >
> > > why do you need a t2 table in this test?
> >
> > t1 is not incremented in case of
> >
> > insert into t2 select * from t1;
>
> add a comment please

Added.

>
> > > > +--echo # Here t2 is incremented too, but not updated
> > > > +set timestamp= unix_timestamp('2000-01-01 03:00:00');
> > > > +update t1, t2 set t1.x= 0 where t1.x< t2.y;
> > > > +--replace_result $default_engine DEFAULT_ENGINE
> > > > +show create table t1;
> > > > +# Multiupdate_prelocking_strategy::handle_end() is processed after table open.
> > > > +# For PS it is possible to skip unneeded auto-creation because the above happens at
> > > > +# prepare stage and auto-creation is done at execute stage.
> > > > +--replace_result $default_engine DEFAULT_ENGINE 'PARTITIONS 4' 'PARTITIONS ok' 'PARTITIONS 5' 'PARTITIONS ok'
> > >
> > > eh... I don't think this is really "ok".
> > > As far as I remember, Multiupdate_prelocking_strategy knows what tables
> > > should be opened for writing and what for reading. Why would a new partition
> > > be created for t2?
> >
> > It knows this after tables are opened. Look at handle_end(),
> > specifically mysql_handle_derived(), handle_derived(),
> > setup_fields_with_no_wrap() and check_fields(). I believe all these
> > calls are required to get proper get_table_map().
> >
> > To get this working properly there must be 2-staged open tables,
> > something like PS does.
>
> You mean, a new partition is created for t2 in normal mode and not
> created in --ps?

Yes.

>
> What if you'll also add `&& table_list->updating` to your condition in
> TABLE::vers_need_hist_part() ?

If `Multiupdate_prelocking_strategy::handle_end()` was not yet
executed how `updating` can help? No, it doesn't help.

>
> > > > diff --git a/mysql-test/suite/versioning/r/partition.result b/mysql-test/suite/versioning/r/partition.result
> > > > index 2a277b1c2ea..8369b40d98c 100644
> > > > --- a/mysql-test/suite/versioning/r/partition.result
> > > > +++ b/mysql-test/suite/versioning/r/partition.result
> > > > @@ -1236,27 +1270,752 @@ t1   CREATE TABLE `t1` (
> > > ...
> > > > +set timestamp= unix_timestamp('2000-01-01 03:00:00');
> > > > +affected rows: 0
> > > > +lock tables t1 write;
> > > > +affected rows: 0
> > > > +execute s;
> > > > +affected rows: 1
> > > > +info: Rows matched: 1  Changed: 1  Inserted: 1  Warnings: 0
> > > > +execute s;
> > > > +affected rows: 1
> > > > +info: Rows matched: 1  Changed: 1  Inserted: 1  Warnings: 0
> > > > +show create table t1;
> > > > +Table        Create Table
> > > > +t1   CREATE TABLE `t1` (
> > > > +  `x` int(11) DEFAULT NULL
> > > > +) ENGINE=DEFAULT_ENGINE DEFAULT CHARSET=latin1 WITH SYSTEM VERSIONING
> > > > + PARTITION BY SYSTEM_TIME INTERVAL 1 HOUR STARTS TIMESTAMP'2000-01-01 00:00:00' AUTO
> > > > +PARTITIONS 6
> > >
> > > why did it add two partitions here?
> >
> > Because of this condition
> >
> >   /*
> >      When hist_part is almost full LOCK TABLES my overflow the partition as we
> >      can't add new partitions under LOCK TABLES. Reserve one more for this case.
> >   */
> >   if (auto_hist && create_count < 2 &&
> >       thd->lex->sql_command == SQLCOM_LOCK_TABLES &&
> >       vers_info->hist_part->id + 1 == vers_info->now_part->id)
> >     create_count++;
>
> Hmm. I thought the logic is - normally for UPDATE/DELETE you create a
> new history partition when you need to write into it. And under LOCK
> TABLES you create the next partition, just in case, even if the current
> partition is still fine. Which kind of makes sense, at least until
> we'll fix that MDEV about creating partitions under LOCK TABLES.
>
> But this doesn't mean LOCK TABLES should create two partitions, I think.
> If UPDATE/DELETE needs to write into a new partition - you create it,
> LOCK TABLES or not. But I'd rather say that if it's just created, then
> it's not quite "almost full" (it's more like "almost empty", even
> "completely empty"). So there's no need to create a second partition at
> the same time.

Right. Fixed. Actually it was bigger bug: it created 1 partition and
when reopened created 1 more partition.

>
> > > > +
> > > >  --source include/rpl_end.inc
> > > > diff --git a/sql/ha_partition.h b/sql/ha_partition.h
> > > > index 60a2d7f6762..7ade4419c22 100644
> > > > --- a/sql/ha_partition.h
> > > > +++ b/sql/ha_partition.h
> > > > @@ -1610,7 +1610,7 @@ class ha_partition :public handler
> > > >      for (; part_id < part_id_end; ++part_id)
> > > >      {
> > > >        handler *file= m_file[part_id];
> > > > -      DBUG_ASSERT(bitmap_is_set(&(m_part_info->read_partitions), part_id));
> > > > +      bitmap_set_bit(&(m_part_info->read_partitions), part_id);
> > >
> > > Where was it set before your patch?
> >
> > #1 in partition_info::set_partition_bitmaps () at ../src/sql/partition_info.cc:274
> > #2 in ha_partition::change_partitions_to_open () at ../src/sql/ha_partition.cc:8644
> > #3 in set_partitions_as_used () at ../src/sql/sql_base.cc:1562
> > #4 in open_table () at ../src/sql/sql_base.cc:1990
> > #5 in open_and_process_table () at ../src/sql/sql_base.cc:3801
> > #6 in open_tables () at ../src/sql/sql_base.cc:4275
>
> okay. Why is it not happening now?
>

Because it is happening at line 1990 in open_table() and
vers_set_hist_part() is happening earlier.

> > > after such an overflow a table becomes essentially corrupted,
> > > rows are in a wrong partition. So new history partitions cannot be added
> > > anymore without reorganizing the last partition.
> > >
> > > How is it handled now?
> > > (it's just a question, not part of the review, as it's not something you
> > > were changing in your patch)
>
> You've seemed to miss this question.

Sorry. They are added via reorganize. I.e. manual ADD PARTITIONS does
data copy. Added test "Partition overflow error and manual fix" and
found pruning boundary bug in 10.3 (see FIXME).

>
> > > > +bool TABLE::vers_need_hist_part(const THD *thd, const TABLE_LIST *table_list) const
> > > > +{
> > > > +#ifdef WITH_PARTITION_STORAGE_ENGINE
> > > > +  if (part_info && part_info->part_type == VERSIONING_PARTITION &&
> > > > +      !table_list->vers_conditions.delete_history &&
> > > > +      !thd->stmt_arena->is_stmt_prepare() &&
> > > > +      table_list->lock_type >= TL_WRITE_ALLOW_WRITE &&
> > > > +       table_list->mdl_request.type >= MDL_SHARED_WRITE &&
> > > > +      table_list->mdl_request.type < MDL_EXCLUSIVE)
> > > > +  {
> > > > +    switch (thd->lex->sql_command)
> > > > +    {
> > >
> > > SQLCOM_LOAD with DUP_REPLACE?
> >
> > Added.
>
> With a test, I hope :)

Yes.

>
> > > > +    if (thd->rgi_slave && thd->rgi_slave->current_event && thd->lex->sql_command == SQLCOM_END)
> > >
> > > I wonder, would it be possible, instead of introducing rgi_slave->current_event,
> > > just make row events set thd->lex->sql_command appropriately?
> > > For example, currently row events increment thd->status_var.com_stat[]
> > > each event for its own SQLCOM_xxx, it won't be needed if they'll just set
> > > thd->lex->sql_command.
> >
> > Do you think this is a quick refactoring? I had the same idea when I
> > did this. And probably I tried to do that quickly already.
> > Added TODO comment.
>
> Right, may be not quick enough for now :(
>
> > > > @@ -2032,6 +2083,20 @@ bool open_table(THD *thd, TABLE_LIST *table_list, Open_table_context *ot_ctx)
> > > >      tc_add_table(thd, table);
> > > >    }
> > > >
> > > > +  if (!ot_ctx->vers_create_count &&
> > >
> > > what does this condition mean? When is ot_ctx->vers_create_count > 0 ?
> >
> > When we already requested backoff action.
>
> Could you describe an example?
>
> I thought if there are two tables in a list and both need to add a new
> partition (like in a multi-update), then backoff will happen on the
> first one, so open_table() won't reach the second with
> ot_ctx->vers_create_count > 0.

open_table() is called again from recover_from_failed_open().
vers_create_count is reset in recover_from_failed_open():

3306            case OT_ADD_HISTORY_PARTITION:
3307            {
3308              result= false;
3309              TABLE *table= open_ltable(m_thd, m_failed_table, TL_WRITE,
3310                        MYSQL_OPEN_HAS_MDL_LOCK |
MYSQL_OPEN_IGNORE_LOGGING_FORMAT);
....
3323              vers_create_count= 0;
3324              break;
3325            }
3326            case OT_BACKOFF_AND_RETRY:
3327            case OT_REOPEN_TABLES:
3328            case OT_NO_ACTION:
3329

In our scheme open_table() is called 3 times:

1. before partitions increment;
2. inside OT_ADD_HISTORY_PARTITION;
3. after partitions increment.

>
> > > > +      table->vers_need_hist_part(thd, table_list))
> > > > +  {
> > > > +    ot_ctx->vers_create_count= table->part_info->vers_set_hist_part(thd, true);
> > > > +    if (ot_ctx->vers_create_count)
> > > > +    {
> > > > +      MYSQL_UNBIND_TABLE(table->file);
> > > > +      tc_release_table(table);
> > > > +      ot_ctx->request_backoff_action(Open_table_context::OT_ADD_HISTORY_PARTITION,
> > > > +                                      table_list);
> > > > +      DBUG_RETURN(TRUE);
> > > > +    }
> > > > +  }
> > > > +
> > > >    if (!(flags & MYSQL_OPEN_HAS_MDL_LOCK) &&
> > > >        table->s->table_category < TABLE_CATEGORY_INFORMATION)
> > > >    {
> > > > @@ -3172,10 +3239,18 @@ Open_table_context::recover_from_failed_open()
> > > >        break;
> > > >      case OT_DISCOVER:
> > > >      case OT_REPAIR:
> > > > -      if ((result= lock_table_names(m_thd, m_thd->lex->create_info,
> > > > -                                    m_failed_table, NULL,
> > > > -                                    get_timeout(), 0)))
> > > > +    case OT_ADD_HISTORY_PARTITION:
> > > > +      result= lock_table_names(m_thd, m_thd->lex->create_info, m_failed_table,
> > > > +                               NULL, get_timeout(), 0);
> > > > +      if (result)
> > > > +      {
> > > > +        if (m_action == OT_ADD_HISTORY_PARTITION)
> > > > +        {
> > > > +          m_thd->clear_error();
> > >
> > > why?
> >
> > Because of MDEV-23642.
>
> I see. I generally prefer to avoid thd->clear_error() because of its non
> discriminative nature.
>
> It might be ok here, but then, please add an assert before
> lock_table_names:
>
>   DBUG_ASSERT(!m_thd->get_stmt_da()->is_set());
>

Added.

> > >
> > > > +          result= false;
> > > > +        }
> > > > @@ -7281,7 +7291,7 @@ uint fast_alter_partition_table(THD *thd, TABLE *table,
> > > >          write_log_add_change_partition(lpt) ||
> > > >          ERROR_INJECT_CRASH("crash_add_partition_4") ||
> > > >          ERROR_INJECT_ERROR("fail_add_partition_4") ||
> > > > -        mysql_change_partitions(lpt) ||
> > > > +        mysql_change_partitions(lpt, false) ||
> > >
> > > this way you skip trans_commit_stmt/trans_commit_implicit for
> > > ALTER TABLE ... ADD RANGE/LIST partition.
> > > is it always ok?
> > >
> > > A safer alternative would've been
> > >
> > >    mysql_change_partitions(lpt, !(alter_info->partition_flags & ALTER_PARTITION_AUTO_HIST))
> > >
> > > but it's more complex, so I'd prefer your variant, if we can be sure that it
> > > doesn't break anything.
> >
> > The condition is
> >
> >  if (copy_data && mysql_trans_commit_alter_copy_data(thd))
> >
> > Data is not needed to be copied in ADD RANGE/LIST partition, is it? We
> > can be sure only from testing period until GA.
>
> well, we can never be sure after any amount of testing.
>
> You're right, the data, indeed, won't be copied in ADD RANGE/LIST
> partition. But perhaps InnoDB can still open a transaction there? As far
> as I understand a transaction is started on
> ha_innobase::external_lock(). Will it happen here? Or start_stmt() under
> LOCK TABLES. Is that possible?

mysql_execute_command() does trans_commit_stmt() and trans_commit_implicit().

>
> Regards,
> Sergei
> VP of MariaDB Server Engineering
> and security@xxxxxxxxxxx



--
All the best,

Aleksey Midenkov
@midenok


Follow ups

References