← Back to team overview

maria-developers team mailing list archive

Re: 9bf4b92cbc5: MDEV-17554 Auto-create new partition for system versioned tables with history partitioned by INTERVAL/LIMIT

 

Hi Sergei!


On Wed, Jul 1, 2020 at 12:26 AM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
>
> Hi, Aleksey!
>
> On Jun 29, Aleksey Midenkov wrote:
> > revision-id: 9bf4b92cbc5 (mariadb-10.5.2-168-g9bf4b92cbc5)
> > parent(s): 478301d9b9a
> > author: Aleksey Midenkov <midenok@xxxxxxxxx>
> > committer: Aleksey Midenkov <midenok@xxxxxxxxx>
> > timestamp: 2020-04-17 17:04:24 +0300
> > message:
> >
> > MDEV-17554 Auto-create new partition for system versioned tables with
history partitioned by INTERVAL/LIMIT
> >
> > == Syntax change ==
> >
> > Keyword AUTO_INCREMENT (or AUTO) enables partition auto-creation.
>
> I still think that AUTO_INCREMENT here looks rather out-of-place,
> I don't know how to explain to users why AUTO_INCREMENT is allowed here.
>
> But if we'll have it, we'll have to support it here for a long time
> for compatibility reasons. "Because somebody might be already using it"
>
> I'd suggest to support only AUTO here.

Sorry, that was just the wrong message. I already removed the
AUTO_INCREMENT keyword.

>
> Or, if the parser can do it easily, a much more readable syntax would be
>
>   create table t1 (x int) with system versioning
>   auto partition by system_time interval 1 hour;
>
> this would be the best, easy to read, matches the documentation (where
> it could be called "auto partitioning"). If it won't cause any difficult
> parser problems, I'd prefer us to use that.

Besides the problems of implementation this probably is not correct
logically. AUTO keyword is applicable only to SYSTEM_TIME partitioning so
it should go after SYSTEM_TIME keyword, but is also applicable only to
rotated partitions so it should go after rotation keywords as well.


>
> > create or replace table t1 (x int) with system versioning
> > partition by system_time interval 1 hour auto_increment;
> >
> > create or replace table t1 (x int) with system versioning
> > partition by system_time limit 1000 auto;
> >
> > Or with explicit partitions:
> >
> > create or replace table t1 (x int) with system versioning
> > partition by system_time interval 1 hour auto
> > (partition p0 history, partition pn current);
> >
> > == Description ==
> >
> > Before executing history-generating DML command add N history
> > partitions, so that N would be sufficient for potentially generated
> > history. N > 1 may be required when history is rotated by INTERVAL and
> > timestamp was jumped to future further than interval value.
> >
> > It is assumed that one DML command will not generate history rows more
> > than specified LIMIT. Otherwise history would overflow generated
> > partition, a warning would be printed, and user action would be
> > required to rebuild partitions.
> >
> > Auto-creation is implemented by synchronous
> > fast_alter_partition_table() call from the thread of the executed DML
> > command before the command itself (by the fallback and retry mechanism
> > similar to Discovery feature, see Open_table_context).
> >
> > Creating history partitions was made by the principle of minimal
> > disruption of the main business process. I.e. when procedure of
> > creation fails it will not (if possible) fail the original DML
> > command. Warning info will display what happened and the last history
> > partition may overflow. In such case partition rebuild is required to
> > correctly display history info. User application may detect warning
> > code ER_VERS_HIST_PART_FAILED and stop execution if that is preferred.
> >
> > The name for newly added partitions are generated like default
> > partition names with extension of MDEV-22155 (which avoids name
> > clashes by extending assignment counter to next free-enough gap).
> >
> > These DML commands trigger auto-creation:
> >
> > * DELETE (including multi-delete, excluding DELETE HISTORY)
> > * UPDATE (including multi-update)
> > * REPLACE (including REPLACE .. SELECT)
>
> INSERT ... ON DUPLICATE KEY UPDATE ?
>

Added.

> > diff --git a/mysql-test/suite/versioning/common.inc
b/mysql-test/suite/versioning/common.inc
> > index 355b571e5a0..b35a5138015 100644
> > --- a/mysql-test/suite/versioning/common.inc
> > +++ b/mysql-test/suite/versioning/common.inc
> > @@ -6,6 +6,7 @@ if (!$TEST_VERSIONING_SO)
> >  source include/have_innodb.inc;
> >
> >  set @@session.time_zone='+00:00';
> > +set @@global.time_zone='+00:00';
>
> Why is that? I'd expect your auto-adding of partitions to happen in the
> context of a session, using session time zone.

This is for the "Concurrent DML" test case. New connections do not get
current session time_zone.

>
> Is it for replication?
>
> >  select ifnull(max(transaction_id), 0) into @start_trx_id from
mysql.transaction_registry;
> >  set @test_start=now(6);
> >
> > diff --git a/mysql-test/suite/versioning/r/delete_history.result
b/mysql-test/suite/versioning/r/delete_history.result
> > index cb865a835b3..90c9e4777bb 100644
> > --- a/mysql-test/suite/versioning/r/delete_history.result
> > +++ b/mysql-test/suite/versioning/r/delete_history.result
> > @@ -63,8 +63,6 @@ insert into t values (1);
> >  update t set a= 2;
> >  update t set a= 3;
> >  delete history from t;
> > -Warnings:
> > -Warning      4114    Versioned table `test`.`t`: last HISTORY
partition (`p1`) is out of LIMIT, need more HISTORY partitions
>
> good riddance, it didn't make much sense anyway
>
> >  # The above warning is one command late (MDEV-20345) ^^^
> >  select * from t for system_time all;
> >  a
> > diff --git a/sql/ha_partition.h b/sql/ha_partition.h
> > index 85cb736b5bd..5fb893b4f0e 100644
> > --- a/sql/ha_partition.h
> > +++ b/sql/ha_partition.h
> > @@ -1623,7 +1623,7 @@ class ha_partition :public handler
>
> could you also, please, fix (in a separate commit)
> ha_partition::part_records() to take a proper
> partition_element* argument, not void* ?

Fixed.

>
> >      for (; part_id < part_id_end; ++part_id)
> >      {
> >        handler *file= m_file[part_id];
> > -      DBUG_ASSERT(bitmap_is_set(&(m_part_info->read_partitions),
part_id));
> > +      bitmap_set_bit(&(m_part_info->read_partitions), part_id);
> >        file->info(HA_STATUS_VARIABLE | HA_STATUS_NO_LOCK |
HA_STATUS_OPEN);
> >        part_recs+= file->stats.records;
> >      }
> > diff --git a/sql/partition_info.cc b/sql/partition_info.cc
> > index e9dbf2b49c3..72312ee7ac4 100644
> > --- a/sql/partition_info.cc
> > +++ b/sql/partition_info.cc
> > @@ -814,10 +814,16 @@ bool
partition_info::has_unique_name(partition_element *element)
> >      vers_info->interval   Limit by fixed time interval
> >      vers_info->hist_part  (out) Working history partition
> >  */
> > -void partition_info::vers_set_hist_part(THD *thd)
> > +uint partition_info::vers_set_hist_part(THD *thd, bool auto_inc)
>
> let's not all it auto_inc. it'll add noise to every grep for auto_inc
> issues. What about auto_add ?
>

Since auto-partitioning is already there (HA_USE_AUTO_PARTITION) auto_add
or auto_hist seem to be the good choice. I prefer auto_hist for more
information.

> >  {
> > +  DBUG_ASSERT(!thd->lex->last_table()->vers_conditions.delete_history);
> > +
> > +  uint create_count= 0;
> > +  auto_inc= auto_inc && vers_info->auto_inc;
> > +
> >    if (vers_info->limit)
> >    {
> > +    DBUG_ASSERT(!vers_info->interval.is_set());
> >      ha_partition *hp= (ha_partition*)(table->file);
> >      partition_element *next= NULL;
> >      List_iterator<partition_element> it(partitions);
> > @@ -836,22 +842,26 @@ void partition_info::vers_set_hist_part(THD *thd)
> >      {
> >        if (next == vers_info->now_part)
> >        {
> > -        my_error(WARN_VERS_PART_FULL, MYF(ME_WARNING|ME_ERROR_LOG),
> > -                table->s->db.str, table->s->table_name.str,
> > -                vers_info->hist_part->partition_name, "LIMIT");
> > +        if (auto_inc)
> > +          create_count= 1;
> > +        else
> > +          my_error(WARN_VERS_PART_FULL, MYF(ME_WARNING|ME_ERROR_LOG),
> > +                  table->s->db.str, table->s->table_name.str,
> > +                  vers_info->hist_part->partition_name, "LIMIT");
> >        }
> >        else
> >          vers_info->hist_part= next;
> >      }
> > -    return;
> > +    // reserve at least one history partition
> > +    if (auto_inc && create_count == 0 &&
> > +        vers_info->hist_part->id + 1 == vers_info->now_part->id)
> > +      create_count= 1;
>
> Questionable. What does it solve?

This is for the LOCK TABLES when the history partition is almost full:

-- source include/have_partition.inc

create or replace table t1 (x int)
with system versioning partition by system_time
limit 10 auto;

insert into t1 values (1), (2), (3), (4), (5), (6), (7), (8), (9);
update t1 set x= x + 10;

lock tables t1 write;
update t1 set x= 1 where x = 11;
update t1 set x= 2 where x = 12;
update t1 set x= 3 where x = 13;
unlock tables;

select count(x) from t1 partition (p0);
show create table t1;
drop tables t1;

Imagine 1000000 instead of 10. Locked commands operate on a small subset of
records. Nevertheless they are condemned to end in the wrong partition if
we didn't reserve one at the LOCK TABLES (we cannot add new partitions
while the table is locked). That of course works only if LOCK TABLES is
short enough to not fill the whole partition (which is expected to be a
normal scenario). We will not need this when we auto-add under LOCK TABLES.

>
> >    }
> > -
> > -  if (vers_info->interval.is_set())
> > +  else if (vers_info->interval.is_set() &&
> > +           vers_info->hist_part->range_value <= thd->query_start())
> >    {
> > -    if (vers_info->hist_part->range_value > thd->query_start())
> > -      return;
> > -
> >      partition_element *next= NULL;
> > +    bool error= true;
> >      List_iterator<partition_element> it(partitions);
> >      while (next != vers_info->hist_part)
> >        next= it++;
> > @@ -860,12 +870,200 @@ void partition_info::vers_set_hist_part(THD *thd)
> >      {
> >        vers_info->hist_part= next;
> >        if (next->range_value > thd->query_start())
> > -        return;
> > +      {
> > +        error= false;
> > +        break;
>
> but here you don't "reserve at least one history partition"
> Why INTERVAL is different?

Now as you asked I understand that INTERVAL is not different. Remade this
condition for both clauses and limited it to LOCK TABLES-only.

>
> > +      }
> > +    }
> > +    if (error)
> > +    {
> > +      if (auto_inc)
> > +      {
> > +        DBUG_ASSERT(thd->query_start() >=
vers_info->hist_part->range_value);
> > +        my_time_t diff= thd->query_start() -
vers_info->hist_part->range_value;
> > +        if (diff > 0)
> > +        {
> > +          size_t delta= vers_info->interval.seconds();
> > +          create_count= diff / delta + 1;
> > +          if (diff % delta)
> > +            create_count++;
> > +        }
> > +        else
> > +          create_count= 1;
> > +      }
> > +      else
> > +      {
> > +        my_error(WARN_VERS_PART_FULL, MYF(ME_WARNING|ME_ERROR_LOG),
> > +                table->s->db.str, table->s->table_name.str,
> > +                vers_info->hist_part->partition_name, "INTERVAL");
> > +      }
> >      }
> > -    my_error(WARN_VERS_PART_FULL, MYF(ME_WARNING|ME_ERROR_LOG),
> > -            table->s->db.str, table->s->table_name.str,
> > -            vers_info->hist_part->partition_name, "INTERVAL");
> >    }
> > +
> > +  return create_count;
> > +}
> > +
> > +
> > +/**
> > +  @brief Run fast_alter_partition_table() to add new history partitions
> > +         for tables requiring them.
> > +*/
> > +void vers_add_auto_parts(THD *thd, TABLE_LIST* tl, uint num_parts)
> > +{
> > +  HA_CREATE_INFO create_info;
> > +  Alter_info alter_info;
> > +  String query;
> > +  partition_info *save_part_info= thd->work_part_info;
> > +  Query_tables_list save_query_tables;
> > +  Reprepare_observer *save_reprepare_observer=
thd->m_reprepare_observer;
> > +  Diagnostics_area new_stmt_da(thd->query_id, false, true);
> > +  Diagnostics_area *save_stmt_da= thd->get_stmt_da();
> > +  bool save_no_write_to_binlog= thd->lex->no_write_to_binlog;
> > +  const CSET_STRING save_query= thd->query_string;
> > +  thd->m_reprepare_observer= NULL;
> > +  thd->lex->reset_n_backup_query_tables_list(&save_query_tables);
> > +  thd->in_sub_stmt|= SUB_STMT_AUTO_HIST;
> > +  thd->lex->no_write_to_binlog=
!thd->is_current_stmt_binlog_format_row();
> > +  TABLE *table= tl->table;
> > +
> > +  DBUG_ASSERT(!thd->is_error());
> > +  /* NB: we have to preserve m_affected_rows, m_row_count_func,
m_last_insert_id, etc */
> > +  thd->set_stmt_da(&new_stmt_da);
> > +  new_stmt_da.set_overwrite_status(true);
>
> Is it needed?
> You've just started using a new Diagnostics_area, the status should be
> DA_EMPTY here, nothing to overwrite.

I rethought my idea of not failing the main command and removed
new_stmt_da. Better and simpler is to fail the command if alter fails.

>
> > +
> > +  {
> > +    DBUG_ASSERT(table->s->get_table_ref_type() ==
TABLE_REF_BASE_TABLE);
> > +    DBUG_ASSERT(table->versioned());
> > +    DBUG_ASSERT(table->part_info);
> > +    DBUG_ASSERT(table->part_info->vers_info);
> > +    alter_info.reset();
> > +    alter_info.partition_flags=
ALTER_PARTITION_ADD|ALTER_PARTITION_AUTO_HIST;
> > +    create_info.init();
> > +    create_info.alter_info= &alter_info;
> > +    Alter_table_ctx alter_ctx(thd, tl, 1, &table->s->db,
&table->s->table_name);
> > +
> > +    MDL_REQUEST_INIT(&tl->mdl_request, MDL_key::TABLE, tl->db.str,
> > +                    tl->table_name.str, MDL_SHARED_NO_WRITE,
MDL_TRANSACTION);
> > +    if (thd->mdl_context.acquire_lock(&tl->mdl_request,
> > +
 thd->variables.lock_wait_timeout))
> > +      goto exit;
> > +    table->mdl_ticket= tl->mdl_request.ticket;
> > +
> > +    create_info.db_type= table->s->db_type();
> > +    create_info.options|= HA_VERSIONED_TABLE;
> > +    DBUG_ASSERT(create_info.db_type);
> > +
> > +
 create_info.vers_info.set_start(table->s->vers_start_field()->field_name);
> > +
 create_info.vers_info.set_end(table->s->vers_end_field()->field_name);
> > +
> > +    partition_info *part_info= new partition_info();
> > +    if (unlikely(!part_info))
> > +    {
> > +      my_error(ER_OUT_OF_RESOURCES, MYF(0));
> > +      goto exit;
> > +    }
> > +    part_info->use_default_num_partitions= false;
> > +    part_info->use_default_num_subpartitions= false;
> > +    part_info->num_parts= num_parts;
> > +    part_info->num_subparts= table->part_info->num_subparts;
> > +    part_info->subpart_type= table->part_info->subpart_type;
> > +    if (unlikely(part_info->vers_init_info(thd)))
> > +    {
> > +      my_error(ER_OUT_OF_RESOURCES, MYF(0));
> > +      goto exit;
> > +    }
> > +
> > +    // NB: set_ok_status() requires DA_EMPTY
> > +    thd->get_stmt_da()->reset_diagnostics_area();
>
> Is it needed?
> You've just started using a new Diagnostics_area, the status should be
> DA_EMPTY here.

Reworked according to the above.

>
> > +
> > +    thd->work_part_info= part_info;
> > +    if (part_info->set_up_defaults_for_partitioning(thd, table->file,
NULL,
> > +
 table->part_info->next_part_no(num_parts)))
> > +    {
> > +      push_warning(thd, Sql_condition::WARN_LEVEL_WARN,
> > +                   ER_VERS_HIST_PART_FAILED,
> > +                   "Auto-increment history partition: "
> > +                   "setting up defaults failed");
> > +      my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING),
> > +               tl->db.str, tl->table_name.str);
> > +      goto exit;
> > +    }
> > +    bool partition_changed= false;
> > +    bool fast_alter_partition= false;
> > +    if (prep_alter_part_table(thd, table, &alter_info, &create_info,
> > +                              &partition_changed,
&fast_alter_partition))
> > +    {
> > +      push_warning(thd, Sql_condition::WARN_LEVEL_WARN,
ER_VERS_HIST_PART_FAILED,
> > +                   "Auto-increment history partition: "
> > +                   "alter partitition prepare failed");
> > +      my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING),
> > +               tl->db.str, tl->table_name.str);
> > +      goto exit;
> > +    }
> > +    if (!fast_alter_partition)
> > +    {
> > +      push_warning(thd, Sql_condition::WARN_LEVEL_WARN,
ER_VERS_HIST_PART_FAILED,
> > +                   "Auto-increment history partition: "
> > +                   "fast alter partitition is not possible");
> > +      my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING),
> > +               tl->db.str, tl->table_name.str);
> > +      goto exit;
> > +    }
> > +    DBUG_ASSERT(partition_changed);
> > +    if (mysql_prepare_alter_table(thd, table, &create_info,
&alter_info,
> > +                                  &alter_ctx))
> > +    {
> > +      push_warning(thd, Sql_condition::WARN_LEVEL_WARN,
ER_VERS_HIST_PART_FAILED,
> > +                   "Auto-increment history partition: "
> > +                   "alter prepare failed");
> > +      my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING),
> > +               tl->db.str, tl->table_name.str);
> > +      goto exit;
> > +    }
> > +
> > +    // Forge query string for rpl logging
> > +    if (!thd->lex->no_write_to_binlog)
> > +    {
> > +      query.set(STRING_WITH_LEN("ALTER TABLE `"), &my_charset_latin1);
>
> better use StringBuffer<...> query;
> because you're doing lots of reallocs now.
>
> > +
> > +      if (query.append(table->s->db) ||
> > +          query.append(STRING_WITH_LEN("`.`")) ||
> > +          query.append(table->s->table_name) ||
>
> this is wrong, the table and db names can have backticks inside.
> use append_identifier() instead.
>
> > +          query.append("` ADD PARTITION PARTITIONS ") ||
> > +          query.append_ulonglong(part_info->num_parts) ||
> > +          query.append(" AUTO"))
> > +      {
> > +        my_error(ER_OUT_OF_RESOURCES, MYF(ME_ERROR_LOG));
> > +        goto exit;
> > +      }
> > +      CSET_STRING qs(query.c_ptr(), query.length(),
&my_charset_latin1);
> > +      thd->set_query(qs);
> > +    }
>
> Should it be binlogged at all? May be just leave it to the slave to
> auto-add the partition as needed?

This is for RBR where auto-add didn't work. Reworked via solution of fixing
the slave side.

>
> Looks a bit suspicious now, you log an ALTER TABLE possibly in the
> middle of a transaction. It will be replayed differently, not as
> auto-adding, in particular, it will commit. So, possibly different
> results on the slave, perhaps different gtid.
>
> Do clear it out with Andrei please. Or don't binlog auto-adding at all.
>
> > +
> > +    if (fast_alter_partition_table(thd, table, &alter_info,
&create_info,
> > +                                   tl, &table->s->db,
&table->s->table_name))
> > +    {
> > +      push_warning(thd, Sql_condition::WARN_LEVEL_WARN,
ER_VERS_HIST_PART_FAILED,
> > +                   "Auto-increment history partition: "
> > +                   "alter partition table failed");
> > +      my_error(ER_VERS_HIST_PART_FAILED, MYF(ME_WARNING),
> > +               tl->db.str, tl->table_name.str);
> > +    }
> > +  }
> > +
> > +  if (!thd->transaction.stmt.is_empty())
> > +    trans_commit_stmt(thd);
>
> 1. why would this operation register a transaction?

Transaction is registered here:

#0  0x0000000000c75288 in Ha_trx_info::register_ha
(this=0x7ff8040034b8, trans=0x7ff804004320, ht_arg=0x2d99778) at
/home/midenok/src/mariadb/10.5/src/sql/handler.h:1918
#1  0x0000000000c5975a in trans_register_ha (thd=0x7ff804000d48,
all=false, ht_arg=0x2d99778, trxid=422179462451680) at
/home/midenok/src/mariadb/10.5/src/sql/handler.cc:1322
#2  0x0000000001111d44 in innobase_register_trx (hton=0x2d99778,
thd=0x7ff804000d48, trx=0x7ff850e411e0) at
/home/midenok/src/mariadb/10.5/src/storage/innobase/handler/ha_innodb.cc:2608
#3  0x000000000112f9fd in ha_innobase::external_lock
(this=0x7ff804b243c0, thd=0x7ff804000d48, lock_type=1) at
/home/midenok/src/mariadb/10.5/src/storage/innobase/handler/ha_innodb.cc:15524
#4  0x0000000000c5f38e in handler::ha_external_lock
(this=0x7ff804b243c0, thd=0x7ff804000d48, lock_type=1) at
/home/midenok/src/mariadb/10.5/src/sql/handler.cc:6660
#5  0x000000000104f19f in ha_partition::external_lock
(this=0x7ff804ada670, thd=0x7ff804000d48, lock_type=1) at
/home/midenok/src/mariadb/10.5/src/sql/ha_partition.cc:4026
#6  0x0000000000c5f32c in handler::ha_external_lock
(this=0x7ff804ada670, thd=0x7ff804000d48, lock_type=1) at
/home/midenok/src/mariadb/10.5/src/sql/handler.cc:6660
#7  0x0000000000de5b91 in lock_external (thd=0x7ff804000d48,
tables=0x7ff8040160a0, count=1) at
/home/midenok/src/mariadb/10.5/src/sql/lock.cc:393
#8  0x0000000000de56c3 in mysql_lock_tables (thd=0x7ff804000d48,
sql_lock=0x7ff804016080, flags=131104) at
/home/midenok/src/mariadb/10.5/src/sql/lock.cc:338
#9  0x0000000000de47b2 in mysql_lock_tables (thd=0x7ff804000d48,
tables=0x7ff8040158e8, count=1, flags=131104) at
/home/midenok/src/mariadb/10.5/src/sql/lock.cc:301
#10 0x00000000007d487e in open_ltable (thd=0x7ff804000d48,
table_list=0x7ff8040157f0, lock_type=TL_WRITE, lock_flags=131104) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.cc:5187
#11 0x00000000007d3349 in Open_table_context::recover_from_failed_open
(this=0x7ff84805a6e8) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.cc:3258
#12 0x00000000007d5623 in open_tables (thd=0x7ff804000d48,
options=..., start=0x7ff84805b2a8, counter=0x7ff84805b25c, flags=0,
prelocking_strategy=0x7ff84805a7a8) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.cc:4367
#13 0x00000000007c7ac6 in open_tables (thd=0x7ff804000d48,
tables=0x7ff84805b2a8, counter=0x7ff84805b25c, flags=0) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.h:477
#14 0x00000000009d9c13 in mysql_update (thd=0x7ff804000d48,
table_list=0x7ff804014b68, fields=..., values=..., conds=0x0,
order_num=0, order=0x0, limit=18446744073709551615, ignore=false,
found_return=0x7ff84805c178, updated_return=0x7ff84805c170) at
/home/midenok/src/mariadb/10.5/src/sql/sql_update.cc:401
#15 0x000000000089bc0a in mysql_execute_command (thd=0x7ff804000d48)
at /home/midenok/src/mariadb/10.5/src/sql/sql_parse.cc:4247
#16 0x0000000000892f56 in mysql_parse (thd=0x7ff804000d48,
rawbuf=0x7ff804014a90 "update t1 set x= x + 1", length=22,
parser_state=0x7ff84805d598) at
/home/midenok/src/mariadb/10.5/src/sql/sql_parse.cc:7837


> 2. you amended a couple of checks like
>
> -  if (thd->in_sub_stmt)
> +  if (thd->in_sub_stmt & ~SUB_STMT_AUTO_HIST)
>
> but what will happen if it's, for example, an insert inside a trigger?

This is LTM_PRELOCKED mode where we can't handle recover_from_failed_open()
yet (MDEV-23639).

> So a real sub_stmt and SUB_STMT_AUTO_HIST?

This condition will pass.

>
> > +
> > +exit:
> > +  thd->work_part_info= save_part_info;
> > +  thd->m_reprepare_observer= save_reprepare_observer;
> > +  thd->lex->restore_backup_query_tables_list(&save_query_tables);
> > +  thd->in_sub_stmt&= ~SUB_STMT_AUTO_HIST;
> > +  if (!new_stmt_da.is_warning_info_empty())
> > +    save_stmt_da->copy_sql_conditions_from_wi(thd,
new_stmt_da.get_warning_info());
> > +  thd->set_stmt_da(save_stmt_da);
> > +  thd->lex->no_write_to_binlog= save_no_write_to_binlog;
> > +  thd->set_query(save_query);
> >  }
> >
> >
> > diff --git a/sql/partition_info.h b/sql/partition_info.h
> > index 7ae2d168068..ca68e61cbe2 100644
> > --- a/sql/partition_info.h
> > +++ b/sql/partition_info.h
> > @@ -72,9 +73,34 @@ struct Vers_part_info : public Sql_alloc
> >      my_time_t start;
> >      INTERVAL step;
> >      enum interval_type type;
> > -    bool is_set() { return type < INTERVAL_LAST; }
> > +    bool is_set() const { return type < INTERVAL_LAST; }
> > +    size_t seconds() const
> > +    {
> > +      if (step.second)
> > +        return step.second;
> > +      if (step.minute)
> > +        return step.minute * 60;
> > +      if (step.hour)
> > +        return step.hour * 3600;
> > +      if (step.day)
> > +        return step.day * 3600 * 24;
> > +      // comparison is used in rough estimates, it doesn't need to be
calendar-correct
>
> Are you sure the approximate value is enough here? You use it to
> estimate how many partitions to create. It's not a big deal if you
> create more than necessary (although it's not nice and we'll definitely
> get bug reports if that will happen). But you surely don't want to
> create less.

True and that's why I made 30 days a month and 365 days a year: lower value
makes more partitions. Probably a student task to make calendar-correct
calculations or utilize libc functions.

>
> > +      if (step.month)
> > +        return step.month * 3600 * 24 * 30;
> > +      DBUG_ASSERT(step.year);
> > +      return step.year * 86400 * 30 * 365;
> > +    }
> > +    bool lt(size_t secs) const
> > +    {
> > +      return seconds() < secs;
> > +    }
> > +    bool ge(size_t seconds) const
> > +    {
> > +      return !(this->lt(seconds));
> > +    }
>
> lt() and ge() don't seem to be used anywhere.

Removed.

>
> >    } interval;
> >    ulonglong limit;
> > +  bool auto_inc;
> >    partition_element *now_part;
> >    partition_element *hist_part;
> >  };
> > diff --git a/sql/sql_base.cc b/sql/sql_base.cc
> > index b89be77f282..5583d70f8eb 100644
> > --- a/sql/sql_base.cc
> > +++ b/sql/sql_base.cc
> > @@ -1862,6 +1862,25 @@ bool open_table(THD *thd, TABLE_LIST
*table_list, Open_table_context *ot_ctx)
> >        DBUG_PRINT("info",("Using locked table"));
> >  #ifdef WITH_PARTITION_STORAGE_ENGINE
> >        part_names_error= set_partitions_as_used(table_list, table);
> > +      if (table->part_info &&
> > +          table->part_info->part_type == VERSIONING_PARTITION &&
> > +          !table_list->vers_conditions.delete_history &&
> > +          table_list->lock_type >= TL_WRITE_ALLOW_WRITE &&
> > +          table_list->mdl_request.type == MDL_SHARED_WRITE)
> > +      {
> > +        switch (thd->lex->sql_command)
> > +        {
> > +        case SQLCOM_DELETE:
> > +        case SQLCOM_UPDATE:
> > +        case SQLCOM_REPLACE:
> > +        case SQLCOM_REPLACE_SELECT:
> > +        case SQLCOM_DELETE_MULTI:
> > +        case SQLCOM_UPDATE_MULTI:
> > +          /* Rotation is still needed under LOCK TABLES */
> > +          table->part_info->vers_set_hist_part(thd, false);
>
> ALTER TABLE works under LOCK TABLES, so this should be possible too,
> but let's have it as a separate task.

Ok, I'd wish it was doable. MDEV-23639 is the task.

>
> > +        default:;
> > +        }
> > +      }
> >  #endif
> >        goto reset;
> >      }
> > @@ -2104,6 +2123,37 @@ bool open_table(THD *thd, TABLE_LIST
*table_list, Open_table_context *ot_ctx)
> >      tc_add_table(thd, table);
> >    }
> >
> > +#ifdef WITH_PARTITION_STORAGE_ENGINE
> > +  if (table->part_info &&
> > +      table->part_info->part_type == VERSIONING_PARTITION &&
> > +      !table_list->vers_conditions.delete_history &&
> > +      table_list->lock_type >= TL_WRITE_ALLOW_WRITE &&
> > +      table_list->mdl_request.type >= MDL_SHARED_WRITE &&
> > +      table_list->mdl_request.type < MDL_EXCLUSIVE)
> > +  {
> > +    switch (thd->lex->sql_command)
> > +    {
> > +    case SQLCOM_LOCK_TABLES:
> > +    case SQLCOM_DELETE:
> > +    case SQLCOM_UPDATE:
> > +    case SQLCOM_REPLACE:
> > +    case SQLCOM_REPLACE_SELECT:
> > +    case SQLCOM_DELETE_MULTI:
> > +    case SQLCOM_UPDATE_MULTI:
>
> this is quite complex condition, better not to duplicate it. May be
> something like
>
>    bool need_set_hist_part(TABLE_LIST *table_list, enum_sql_command
sql_command)
>    {
>      if (table_list->table->part_info &&
>          table_list->table->part_info->part_type == VERSIONING_PARTITION
&&
>          !table_list->vers_conditions.delete_history &&
>          table_list->lock_type >= TL_WRITE_ALLOW_WRITE &&
>          table_list->mdl_request.type >= MDL_SHARED_WRITE &&
>          table_list->mdl_request.type < MDL_EXCLUSIVE)
>      {
>        switch(sql_command) {
>        case SQLCOM_LOCK_TABLES:
>        case SQLCOM_DELETE:
>        case SQLCOM_UPDATE:
>        case SQLCOM_REPLACE:
>        case SQLCOM_REPLACE_SELECT:
>        case SQLCOM_DELETE_MULTI:
>        case SQLCOM_UPDATE_MULTI:
>          return true;
>        }
>        default:;
>      }
>      return false;
>    }
>

Fixed.

> > +      ot_ctx->vers_create_count=
table->part_info->vers_set_hist_part(thd, true);
> > +      if (ot_ctx->vers_create_count)
> > +      {
> > +
 ot_ctx->request_backoff_action(Open_table_context::OT_ADD_HISTORY_PARTITION,
> > +                                       table_list);
> > +        MYSQL_UNBIND_TABLE(table->file);
> > +        tc_release_table(table);
>
> There's no MYSQL_UNBIND_TABLE/tc_release_table after other
> request_backoff_action invocations. Why this one is special?
>

It might be different by the TABLE object state. There is no dependence
for request_backoff_action() rather it is needed for
further tdc_remove_table() and handler::rebind_psi().

Info: hang in wait_for_refs()

If we don't do tc_release_table() on request_backoff_action().

#0  futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x7fffc4b0f638) at
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
mutex=0x7fffc4b0f588, cond=0x7fffc4b0f610) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7fffc4b0f610, mutex=0x7fffc4b0f588) at
pthread_cond_wait.c:638
#3  0x00000000016ac979 in safe_cond_wait (cond=0x7fffc4b0f610,
mp=0x7fffc4b0f560, file=0x175364a
"/home/midenok/src/mariadb/10.5/src/include/mysql/psi/mysql_thread.h",
line=1220) at /home/midenok/src/mariadb/10.5/src/mysys/thr_mutex.c:492
#4  0x0000000000b62814 in inline_mysql_cond_wait (that=0x7fffc4b0f610,
mutex=0x7fffc4b0f560, src_file=0x17cc940
"/home/midenok/src/mariadb/10.5/src/sql/table_cache.cc",
src_line=1236) at
/home/midenok/src/mariadb/10.5/src/include/mysql/psi/mysql_thread.h:1220
#5  0x0000000000b67b56 in TDC_element::wait_for_refs
(this=0x7fffc4b0f3c8, my_refs=1) at
/home/midenok/src/mariadb/10.5/src/sql/table_cache.cc:1236
#6  0x0000000000b65a50 in tdc_remove_referenced_share
(thd=0x7fffc4000d48, share=0x7fffc4ac5550) at
/home/midenok/src/mariadb/10.5/src/sql/table_cache.cc:1006
#7  0x0000000000b65f8e in tdc_remove_table (thd=0x7fffc4000d48,
db=0x7fffc4015238 "test", table_name=0x7fffc4014b30 "t1") at
/home/midenok/src/mariadb/10.5/src/sql/table_cache.cc:1061
#8  0x00000000007d30b1 in Open_table_context::recover_from_failed_open
(this=0x7ffff06666e8) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.cc:3235
#9  0x00000000007d55d3 in open_tables (thd=0x7fffc4000d48,
options=..., start=0x7ffff06672a8, counter=0x7ffff066725c, flags=0,
prelocking_strategy=0x7ffff06667a8) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.cc:4378
#10 0x00000000007c7a36 in open_tables (thd=0x7fffc4000d48,
tables=0x7ffff06672a8, counter=0x7ffff066725c, flags=0) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.h:477
#11 0x00000000009d9bc3 in mysql_update (thd=0x7fffc4000d48,
table_list=0x7fffc4014b68, fields=..., values=..., conds=0x0,
order_num=0, order=0x0, limit=18446744073709551615, ignore=false,
found_return=0x7ffff0668178, updated_return=0x7ffff0668170) at
/home/midenok/src/mariadb/10.5/src/sql/sql_update.cc:401

Without MYSQL_UNBIND_TABLE():

#7  0x00007fa00aa23006 in __GI___assert_fail (assertion=0x196de1f
"pfs->m_thread_owner == __null", file=0x196d990
"/home/midenok/src/mariadb/10.5/src/storage/perfschema/pfs.cc",
line=2007, function=0x196de3d "PSI_table
*pfs_rebind_table_v1(PSI_table_share *, const void *, PSI_table *)")
at assert.c:101
#8  0x000000000106ec18 in pfs_rebind_table_v1 (share=0x343d540,
identity=0x7f9fc8c62e70, table=0x7f9fc83a7700) at
/home/midenok/src/mariadb/10.5/src/storage/perfschema/pfs.cc:2007
#9  0x0000000000c5f3ce in handler::rebind_psi (this=0x7f9fc8c62e70) at
/home/midenok/src/mariadb/10.5/src/sql/handler.cc:2918
#10 0x00000000007d0d95 in open_table (thd=0x7f9fb8000d48,
table_list=0x7f9fb8011508, ot_ctx=0x7f9fffd656e8) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.cc:2019
#11 0x00000000007d6be7 in open_and_process_table (thd=0x7f9fb8000d48,
tables=0x7f9fb8011508, counter=0x7f9fffd6625c, flags=0,
prelocking_strategy=0x7f9fffd657a8, has_prelocking_list=false,
ot_ctx=0x7f9fffd656e8) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.cc:3876
#12 0x00000000007d5530 in open_tables (thd=0x7f9fb8000d48,
options=..., start=0x7f9fffd662a8, counter=0x7f9fffd6625c, flags=0,
prelocking_strategy=0x7f9fffd657a8) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.cc:4348
#13 0x00000000007c7a36 in open_tables (thd=0x7f9fb8000d48,
tables=0x7f9fffd662a8, counter=0x7f9fffd6625c, flags=0) at
/home/midenok/src/mariadb/10.5/src/sql/sql_base.h:477
#14 0x00000000009d9bd3 in mysql_update (thd=0x7f9fb8000d48,
table_list=0x7f9fb8011508, fields=..., values=...,
conds=0x7f9fb80121f8, order_num=0, order=0x0,
limit=18446744073709551615, ignore=false, found_return=0x7f9fffd67178,
updated_return=0x7f9fffd67170) at
/home/midenok/src/mariadb/10.5/src/sql/sql_update.cc:401
#15 0x000000000089bbca in mysql_execute_command (thd=0x7f9fb8000d48)
at /home/midenok/src/mariadb/10.5/src/sql/sql_parse.cc:4247
#16 0x0000000000892f16 in mysql_parse (thd=0x7f9fb8000d48,
rawbuf=0x7f9fb8011410 "update t1 set x= x + 122 where x = 1",
length=36, parser_state=0x7f9fffd68598) at
/home/midenok/src/mariadb/10.5/src/sql/sql_parse.cc:7837


There is also tc_release_table():

    if (table->file->discover_check_version())
    {
      tc_release_table(table);
      (void) ot_ctx->request_backoff_action(Open_table_context::OT_DISCOVER,
                                            table_list);
      DBUG_RETURN(TRUE);
    }

> > +        DBUG_RETURN(TRUE);
> > +      }
> > +    default:;
> > +    }
> > +  }
> > +#endif
> > +
> >    if (!(flags & MYSQL_OPEN_HAS_MDL_LOCK) &&
> >        table->s->table_category < TABLE_CATEGORY_INFORMATION)
> >    {
> > diff --git a/sql/sql_class.h b/sql/sql_class.h
> > index 14516811262..145ac5c5f64 100644
> > --- a/sql/sql_class.h
> > +++ b/sql/sql_class.h
> > @@ -3346,6 +3347,10 @@ class THD: public THD_count, /* this must be
first */
> >
> >  #ifdef WITH_PARTITION_STORAGE_ENGINE
> >    partition_info *work_part_info;
> > +  /**
> > +    List of tables requiring new history partition.
> > +  */
> > +  List<TABLE_SHARE> vers_auto_part_tables;
>
> this doesn't seem to be used anywhere
>

Removed.

> >  #endif
> >
> >  #ifndef EMBEDDED_LIBRARY
> > diff --git a/sql/sql_yacc.yy b/sql/sql_yacc.yy
> > index 7cc1faea79b..3f6c1793432 100644
> > --- a/sql/sql_yacc.yy
> > +++ b/sql/sql_yacc.yy
> > @@ -7521,14 +7529,17 @@ add_partition_rule:
> >
> >  add_part_extra:
> >            /* empty */
> > -        | '(' part_def_list ')'
> > +        | '(' part_def_list ')' opt_vers_auto_inc
> >            {
> > -            LEX *lex= Lex;
> > -            lex->part_info->num_parts=
lex->part_info->partitions.elements;
> > +            Lex->part_info->num_parts=
Lex->part_info->partitions.elements;
> > +            if ($4)
> > +              Lex->alter_info.partition_flags|=
ALTER_PARTITION_AUTO_HIST;
> >            }
> > -        | PARTITIONS_SYM real_ulong_num
> > +        | PARTITIONS_SYM real_ulong_num opt_vers_auto_inc
> >            {
> >              Lex->part_info->num_parts= $2;
> > +            if ($3)
> > +              Lex->alter_info.partition_flags|=
ALTER_PARTITION_AUTO_HIST;
>
> I'm confused. I thought that ALTER_PARTITION_AUTO_HIST is what you set
> in vers_add_auto_parts() to mark auto-adding of a new partition.
>
> But here you set it in the parser, when a user specifies AUTO in
> partition specifications. So it seems you're using this flag for two
> very different purposes. How do you distinguish between them?
> And why did you decide to do it this way?

This is auto-add via ALTER command for RBR replication. Removed as part of
RBR rework.

>
> >            }
> >          ;
> >
> > diff --git a/sql/sql_partition.cc b/sql/sql_partition.cc
> > index fbed614489d..25017ee5425 100644
> > --- a/sql/sql_partition.cc
> > +++ b/sql/sql_partition.cc
> > @@ -4825,6 +4829,13 @@ uint prep_alter_part_table(THD *thd, TABLE
*table, Alter_info *alter_info,
> >      DBUG_RETURN(TRUE);
> >    }
> >
> > +  if (alter_info->partition_flags & ALTER_PARTITION_AUTO_HIST &&
> > +      (!table->part_info || !table->part_info->vers_info))
> > +  {
> > +    my_error(ER_SYNTAX_ERROR, MYF(0));
>
> can it even happen? Or should it be DBUG_ASSERT(0) here?

This was for the above sql_yacc changes accordingly. Removed this as well.

>
> > +    DBUG_RETURN(TRUE);
> > +  }
> > +
> >    partition_info *alt_part_info= thd->lex->part_info;
> >    /*
> >      This variable is TRUE in very special case when we add only DEFAULT
> > @@ -5312,7 +5323,9 @@ that are reorganised.
> >                now_part= el;
> >              }
> >            }
> > -          if (*fast_alter_table &&
tab_part_info->vers_info->interval.is_set())
> > +          if (*fast_alter_table &&
> > +              !(alter_info->partition_flags &
ALTER_PARTITION_AUTO_HIST) &&
> > +              tab_part_info->vers_info->interval.is_set())
>
> this !ALTER_PARTITION_AUTO_HIST - do you mean not vers_add_auto_parts() or
> not sql statement that uses AUTO ?

This is for not auto-adding so it is for both. Removed the second so it is
for the first.

>
> >            {
> >              partition_element *hist_part=
tab_part_info->vers_info->hist_part;
> >              if (hist_part->range_value <= thd->query_start())
> > @@ -5347,7 +5360,8 @@ that are reorganised.
> >        */
> >        if (!(alter_info->partition_flags & ALTER_PARTITION_TABLE_REORG))
> >        {
> > -        if (!alt_part_info->use_default_partitions)
> > +        if (!alt_part_info->use_default_partitions &&
> > +            !(alter_info->partition_flags & ALTER_PARTITION_AUTO_HIST))
>
> Sorry, I don't understand that if() at all. What does it do?

This was for the second only (sql statement) so removed this as well.

>
> >          {
> >            DBUG_PRINT("info", ("part_info: %p", tab_part_info));
> >            tab_part_info->use_default_partitions= FALSE;
>
> Regards,
> Sergei
> VP of MariaDB Server Engineering
> and security@xxxxxxxxxxx



--
All the best,

Aleksey Midenkov
@midenok

Follow ups

References