← Back to team overview

maria-developers team mailing list archive

Re: 8742d176bc2: Added support for more functions when using partitioned S3 tables

 

Hi!

On Fri, Apr 17, 2020 at 3:49 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:

<cut>

> > +++ b/mysql-test/suite/s3/replication_partition.test
> > @@ -0,0 +1,170 @@
> > +--source include/have_s3.inc
> > +--source include/have_partition.inc
> > +--source include/master-slave.inc
>
> master-slave should always be included last, after all other have_xxx includes.

<cut>

> > diff --git a/mysys/my_symlink.c b/mysys/my_symlink.c
> > index cbee78a7f5c..323ae69a39c 100644
> > --- a/mysys/my_symlink.c
> > +++ b/mysys/my_symlink.c
> > @@ -154,7 +154,8 @@ int my_realpath(char *to, const char *filename, myf MyFlags)
> >        original name but will at least be able to resolve paths that starts
> >        with '.'.
> >      */
> > -    DBUG_PRINT("error",("realpath failed with errno: %d", errno));
> > +    if (MyFlags)
> > +      DBUG_PRINT("error",("realpath failed with errno: %d", errno));
>
> This is kind of against the concept of dbug. dbug's idea is that one should
> not edit the code every time, but put DBUG points one and enable/disable them
> at run-time.
>
> In cases like that you can specify precisely what you want to see in the trace
> with complicated command-line --debug string. But I usually just let it
> log everything and then use dbug/remove_function_from_trace.pl script.

The reason for the above is that it was not a real error, but a check if the
file was a path.

The reason for disabling this is that when I have a problem, I usually search in
the trrace for "error:".  With big traces, there is a lot of calls to
realpath and I
have to skip all these which is very annoying and not helpful as the above
case is not an error!

In other words, we should only log things with label "error" when it's
really is an error
Changing the label depending on MyFlags is not helpful as the rest of
the dbug trace
will anyway show the result from realpath()


> > diff --git a/sql/discover.cc b/sql/discover.cc
> > index e49a2a3b0c0..7f3fe73c155 100644
> > --- a/sql/discover.cc
> > +++ b/sql/discover.cc
> > @@ -99,23 +99,23 @@ int readfrm(const char *name, const uchar **frmdata, size_t *len)
> >
> >
> >  /*
> > -  Write the content of a frm data pointer
> > -  to a frm file.
> > +  Write the content of a frm data pointer to a frm or par file.
>
> really? writefrm() writes to a .par file?
Yes, that was confusing and the reason I changed the comment.

> may be rename it to writefile() then?

Yes, I will do a rename of this function.

> > diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc
> > index 0ec1f2138ab..b8b5085f389 100644
> > --- a/sql/ha_partition.cc
> > +++ b/sql/ha_partition.cc
> > @@ -629,7 +629,8 @@ int ha_partition::rename_table(const char *from, const char *to)
> >
> >    SYNOPSIS
> >      create_partitioning_metadata()
> > -    name                              Full path of table name
> > +    path                              Full path of new table name
> > +    old_p                             Full path of old table name
>
> This is a rather meaningless phrase, there cannot be a "path" of a "name"
>
> May be, write "A path to the new/old frm file but without the extension"
> instead?

Fixed

> >      create_info                       Create info generated for CREATE TABLE
> >
> >    RETURN VALUE
> > @@ -678,6 +681,19 @@ int ha_partition::create_partitioning_metadata(const char *path,
> >        DBUG_RETURN(1);
> >      }
> >    }
> > +
> > +  /* m_part_info is only NULL when we failed to create a partition table */
>
> How can m_part_info be NULL here?
> create_handler_file() dereferences m_part_info unconditionally, if
> m_part_info would've been NULL it'll crash.

Yes, but create_handler_part() is only used for create, not f or drop
or rename and in these
cases m_part_info can be 0.

I added this, as I got this error while testing S3. One of the test in
S3 covers this code.
If you comment the 'if', you will get a core dump (just tested that)
in s3.replication_partition

> > @@ -1604,6 +1620,7 @@ int ha_partition::prepare_new_partition(TABLE *tbl,
> >
> >    if (!(file->ht->flags & HTON_CAN_READ_CONNECT_STRING_IN_PARTITION))
> >      tbl->s->connect_string= p_elem->connect_string;
> > +  create_info->options|= HA_CREATE_TMP_ALTER;
>
> This looks wrong. The partition isn't created as a temporary file.
> HA_CREATE_TMP_ALTER is used for autiding and perfschema to distinguish
> between normal tables, temporary tables and these alter-tmp-tables.

> I see that you've hijacked this flag and check it inside s3, but it wasn't
> designed for that, so change s3 instead, not what the flag means.
> I don't think you need that if inside s3 at all, if hton flag says
> HTON_TEMPORARY_NOT_SUPPORTED, then s3 won't be asked to create a temporary
> table and you don't need a check for that in create.

It's acutally the other way around.  S3 only allows creation of
temporary tables, so I need
a flag for this.
In addition, prepare_new_partition() creates temporary tables, like:
t1#P#p0#TMP# , which will later be renamed to their proper name,
so they should be marked with HA_CREATE_TMP_ALTER.

> >    if ((error= file->ha_create(part_name, tbl, create_info)))
> >    {
> >      /*
> > @@ -2361,14 +2379,20 @@ uint ha_partition::del_ren_table(const char *from, const char *to)
> >    const char *to_path= NULL;
> >    uint i;
> >    handler **file, **abort_file;
> > +  THD *thd= ha_thd();
> >    DBUG_ENTER("ha_partition::del_ren_table");
> >
> > -  if (get_from_handler_file(from, ha_thd()->mem_root, false))
> > -    DBUG_RETURN(TRUE);
> > +  if (get_from_handler_file(from, thd->mem_root, false))
> > +    DBUG_RETURN(my_errno ? my_errno : ENOENT);
> >    DBUG_ASSERT(m_file_buffer);
> >    DBUG_PRINT("enter", ("from: (%s) to: (%s)", from, to ? to : "(nil)"));
> >    name_buffer_ptr= m_name_buffer_ptr;
> > +
> >    file= m_file;
> > +  /* The command should be logged with IF EXISTS if using a shared table */
> > +  if (m_file[0]->ht->flags & HTON_TABLE_MAY_NOT_EXIST_ON_SLAVE)
> > +    thd->replication_flags|= OPTION_IF_EXISTS;
>
> I don't think this and the whole thd->replication_flags is needed,
> because in the sql_table.cc you can check directly
>  table->file->partition_ht()->flags && HTON_TABLE_MAY_NOT_EXIST_ON_SLAVE

If you remove the above, the s3 tests will break.
I don't see where I can check this in sql_table.cc as the table is not opened
and sql_table.cc can't access table->file.

<cut>

> > +  /* Update .par file in the handlers that supports it */
> > +  if ((*m_file)->ht->create_partitioning_metadata)
> > +  {
> > +    if (to == NULL)
> > +      error= (*m_file)->ht->create_partitioning_metadata(NULL, from,
> > +                                                         CHF_DELETE_FLAG);
> > +    else
> > +      error= (*m_file)->ht->create_partitioning_metadata(to, from,
> > +                                                         CHF_RENAME_FLAG);
>
> why not just
>
>   error= (*m_file)->ht->create_partitioning_metadata(to, from);
>
> the engine can distingush CREATE from RENAME from DELETE by looking
> at whether 'to' or `from` is NULL.

There is a also a CHF_CREATE flag, so the function can't change.

However, it can be changed to:

      error= (*m_file)->ht->create_partitioning_metadata(to, from,
                                                        to == NULL ?
CHF_DELETE_FLAG : CHF_RENAME_FLAG);
Now done
<cut>

> > +/*
> > +  Check if the table definition has changed for the part tables
> > +  We use the first partition for the check.
> > +*/
> > +
> > +int ha_partition::discover_check_version()
> > +{
> > +  return m_file[0]->discover_check_version();
> > +}
>
> 1. generally, you have to do it for every open partition, not just for the
> first one.

As  we only support one engine for partitioning the above is safe.
In any case, even if we would support multiple engines, when it come to
a distributed partition file, then in this case all tables MUST be from
the same engine as otherwise things would break for any other server
using the table.

> 2. m_file[0] partition is not necessarily open, you need to use the first
> open partition here (and everywhere)
I have checked the code and m_file[0] always exists. It's used everywhere
all over the code.

As far as I understand, m_file points to the files that where opened, not all
existing ones.

> > +
> >  /**
> >    Clone the open and locked partitioning handler.
> >
> > @@ -11382,6 +11450,12 @@ int ha_partition::end_bulk_delete()
> >  }
> >
> >
> > +bool ha_partition::check_if_updates_are_ignored(const char *op) const
> > +{
> > +  return (handler::check_if_updates_are_ignored(op) ||
> > +          ha_check_if_updates_are_ignored(table->in_use, partition_ht(), op));
>
> no you don't need this virtual method at all, simply do
>
> -  if (ha_check_if_updates_are_ignored(ha_thd(), ht, "DROP"))
> +  if (ha_check_if_updates_are_ignored(ha_thd(), partition_ht(), "DROP"))

Not really, as I also want to check if the partitioned table is shared or not.
I would also have to change all checks everywhere to use partion_ht() and
that would look strange and can easily lead to wrong usage.

It's not up to the caller of ha_check_if_updates_are_ignored() to know
if the table is partitioned
or not.

> that's why partition_ht() was created in the first place.
I use partition_ht() in a lot of places, but I don't think this is the
right place for that.
Better to make this right with a virtual function.

> > +}
> > +
> >  /**
> >    Perform initialization for a direct update request.
> >
> > diff --git a/sql/handler.h b/sql/handler.h
> > index 8c45c64bec8..eadbf28229c 100644
> > --- a/sql/handler.h
> > +++ b/sql/handler.h
> > @@ -3183,7 +3200,10 @@ class handler :public Sql_alloc
> >
> >  public:
> >    virtual void unbind_psi();
> > -  virtual int rebind();
> > +  virtual void rebind_psi();
>
> this doesn't have to be virtual anymore, because you
> moved the check into the virtual discover_check_version()

There are storage engines using rebind_psi().
For that to work, it has to be virtual.

> > diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc
> > index 53ccdb5d4c3..b1756b83056 100644
> > --- a/sql/sql_parse.cc
> > +++ b/sql/sql_parse.cc
> > @@ -5906,6 +5906,8 @@ mysql_execute_command(THD *thd)
> >    case SQLCOM_CALL:
> >    case SQLCOM_REVOKE:
> >    case SQLCOM_GRANT:
> > +    if (thd->variables.option_bits & OPTION_IF_EXISTS)
> > +      lex->create_info.set(DDL_options_st::OPT_IF_EXISTS);
>
> 1. please remove the same (now redundant) code from
>    Sql_cmd_alter_table::execute()

done

> 2. better put it only for the relevant case. Like this
>
>   case SQLCOM_ALTER_TABLE:
>         if (thd->variables.option_bits & OPTION_IF_EXISTS)
>           lex->create_info.set(DDL_options_st::OPT_IF_EXISTS);
>         /* fall through */
>   case SQLCOM_ANALYZE_TABLE:
>         ...

It's not doable that easy.
We share a lot of code between a LOT of SQLCOM_events and splitting these
up is a lot of extra code and doesn't give us anything.

The reason is that if (thd->variables.option_bits & OPTION_IF_EXISTS) is
for this code only set by replication and for commands that support if exists.
This makes it safe to set it for any command.

> > diff --git a/sql/sql_partition.cc b/sql/sql_partition.cc
> > index 7385f9059e1..351464939e2 100644
> > --- a/sql/sql_partition.cc
> > +++ b/sql/sql_partition.cc
> > @@ -7057,6 +7059,10 @@ uint fast_alter_partition_table(THD *thd, TABLE *table,
> >    lpt->pack_frm_data= NULL;
> >    lpt->pack_frm_len= 0;
> >
> > +  /* Add IF EXISTS to binlog if shared table */
> > +  if (table->file->partition_ht()->flags & HTON_TABLE_MAY_NOT_EXIST_ON_SLAVE)
> > +    thd->variables.option_bits|= OPTION_IF_EXISTS;
> > +
>
> No need to, you already have that in mysql_alter_table(). Just fix the check
> there, like this:
>
>   if (!if_exists &&
> -      (table->s->db_type()->flags & HTON_TABLE_MAY_NOT_EXIST_ON_SLAVE))
> +      (table->file->partition_ht()->flags & HTON_TABLE_MAY_NOT_EXIST_ON_SLAVE))

I am already doing that in mysql_alter_table()
However that is not enough as fast_alter_table() doesn't execute the above code.
I tested it by disabling the code in sql_partition.cc and
s3.replication_partition
and it started to fail because of missing IF EXISTS.

> > diff --git a/sql/sql_partition_admin.cc b/sql/sql_partition_admin.cc
> > index ed77c0938f3..d29c014bdc2 100644
> > --- a/sql/sql_partition_admin.cc
> > +++ b/sql/sql_partition_admin.cc
> > @@ -529,14 +530,46 @@ bool Sql_cmd_alter_table_exchange_partition::
> >    table_list->mdl_request.set_type(MDL_SHARED_NO_WRITE);
> >    if (unlikely(open_tables(thd, &table_list, &table_counter, 0,
> >                             &alter_prelocking_strategy)))
> > +  {
> > +    if (thd->lex->if_exists() &&
> > +        thd->get_stmt_da()->sql_errno() == ER_NO_SUCH_TABLE)
> > +    {
> > +      /*
> > +        ALTER TABLE IF EXISTS was used on not existing table
> > +        We have to log the query on a slave as the table may be a shared one
> > +        from the master and we need to ensure that the next slave can see
> > +        the statement as this slave may not have the table shared
> > +      */
> > +      thd->clear_error();
> > +      if (thd->slave_thread &&
> > +          write_bin_log(thd, true, thd->query(), thd->query_length()))
> > +        DBUG_RETURN(true);
> > +      my_ok(thd);
> > +      DBUG_RETURN(false);
> > +    }
>
> I don't like that. There are many Sql_cmd_alter_table* classes and lots of
> duplicated code. Okay, you've put your OPTION_IF_EXISTS check into the big
> switch in the mysql_execute_command() and thus avoided duplicating that code.
>
> But this if() is also in mysql_alter_table() (and may be more?)
> I'm thinking that doing many Sql_cmd_alter_table* classes was a bad idea
> which leads to code duplication. How can we fix that?

The 'if' for doing binlog is only in a fd32 places as far as I can
remember and most of their
code is very different and what they log is different.
Don't have an answer, but don't think there is a lot of code duplication.
There is some extra work to support IF EXITS, but it's largely now taken
care of.

> >      DBUG_RETURN(true);
> > +  }
> >
> >    part_table= table_list->table;
> >    swap_table= swap_table_list->table;
> >
> > +  if (part_table->file->check_if_updates_are_ignored("ALTER"))
> > +  {
> > +    if (thd->slave_thread &&
> > +        write_bin_log_with_if_exists(thd, true, false, true))
> > +      DBUG_RETURN(true);
> > +    my_ok(thd);
> > +    DBUG_RETURN(false);
> > +  }
>
> another duplicated one

Yes, the above two onces we could make into a function.
However there is only two places where we have the above
code!

> > +
> >    if (unlikely(check_exchange_partition(swap_table, part_table)))
> >      DBUG_RETURN(TRUE);
> >
> > +  /* Add IF EXISTS to binlog if shared table */
> > +  if (part_table->file->partition_ht()->flags &
> > +      HTON_TABLE_MAY_NOT_EXIST_ON_SLAVE)
> > +    force_if_exists= 1;
>
> and again

The above is not a duplication of any code.

> > +
> >    /* set lock pruning on first table */
> >    partition_name= alter_info->partition_names.head();
> >    if (unlikely(table_list->table->part_info->
> > @@ -784,16 +821,51 @@ bool Sql_cmd_alter_table_truncate_partition::execute(THD *thd)
> >  #endif /* WITH_WSREP */
> >
> >    if (open_tables(thd, &first_table, &table_counter, 0))
> > -    DBUG_RETURN(true);
> > +  {
> > +    if (thd->lex->if_exists() &&
> > +        thd->get_stmt_da()->sql_errno() == ER_NO_SUCH_TABLE)
> > +    {
> > +      /*
> > +        ALTER TABLE IF EXISTS was used on not existing table
> > +        We have to log the query on a slave as the table may be a shared one
> > +        from the master and we need to ensure that the next slave can see
> > +        the statement as this slave may not have the table shared
> > +      */
> > +      thd->clear_error();
> > +      if (thd->slave_thread &&
> > +          write_bin_log(thd, true, thd->query(), thd->query_length()))
> > +        DBUG_RETURN(TRUE);
> > +      my_ok(thd);
> > +      DBUG_RETURN(FALSE);
> > +    }
> > +    DBUG_RETURN(TRUE);
> > +  }
>
> see?

Yes, we have it in tree places, not really a big deal for a few rows.
However, this is now handled by one function, so a little less code...

> > -  if (!first_table->table || first_table->view ||
> > -      first_table->table->s->db_type() != partition_hton)
> > +  if (!first_table->table || first_table->view)
> >    {
> >      my_error(ER_PARTITION_MGMT_ON_NONPARTITIONED, MYF(0));
> >      DBUG_RETURN(TRUE);
> >    }
> >
> > -
> > +  if (first_table->table->file->check_if_updates_are_ignored("ALTER"))
> > +  {
> > +    if (thd->slave_thread &&
> > +        write_bin_log_with_if_exists(thd, true, false, 1))
> > +      DBUG_RETURN(true);
> > +    my_ok(thd);
> > +    DBUG_RETURN(false);
> > +  }
> > +
> > +  if (first_table->table->s->db_type() != partition_hton)
> > +  {
> > +    my_error(ER_PARTITION_MGMT_ON_NONPARTITIONED, MYF(0));
> > +    DBUG_RETURN(TRUE);
> > +  }
>
> why did you split the check for ER_PARTITION_MGMT_ON_NONPARTITIONED?

To be able to test if updates should be ignored before giving an error
about non_partition_management.

The code is self explanatory.

The reason is that if it's a shared partitioned table, then
(first_table->table->s->db_type() != partition_hton)
is true, but that is not an error.

<cut>

> > diff --git a/sql/table.cc b/sql/table.cc
> > index fe096835144..c6b42b7ba36 100644
> > --- a/sql/table.cc
> > +++ b/sql/table.cc
> > @@ -25,7 +25,8 @@
> >                                                  // primary_key_name
> >  #include "sql_parse.h"                          // free_items
> >  #include "strfunc.h"                            // unhex_type2
> > -#include "sql_partition.h"       // mysql_unpack_partition,
> > +#include "ha_partition.h"        // PART_EXT
> > +                                 // mysql_unpack_partition,
>
> also included below

Good catch. This come from doing a lot of rebases...

<cut>

> > @@ -620,6 +622,9 @@ enum open_frm_error open_table_def(THD *thd, TABLE_SHARE *share, uint flags)
> >    {
> >      DBUG_ASSERT(flags & GTS_TABLE);
> >      DBUG_ASSERT(flags & GTS_USE_DISCOVERY);
> > +    /* Delete .frm and .par files */
> > +    mysql_file_delete_with_symlink(key_file_frm, path, "", MYF(0));
> > +    strxmov(path, share->normalized_path.str, PAR_EXT, NullS);
> >      mysql_file_delete_with_symlink(key_file_frm, path, "", MYF(0));
>
> no need to strxmov, this third "" argument is the file extension, you can directly do
>
>     mysql_file_delete_with_symlink(key_file_frm, share->normalized_path.str, PAR_EXT, MYF(0));

Fixed. Didn't know that the last argument was extension.

> also it should be key_file_partition_ddl_log not key_file_frm
> (it's not a ddl log, but sql_table.cc uses key_file_partition_ddl_log for par files)

Fixed.
<cut>
> >  int TABLE_SHARE::init_from_binary_frm_image(THD *thd, bool write,
> >                                              const uchar *frm_image,
> > -                                            size_t frm_length)
> > +                                            size_t frm_length,
> > +                                            const uchar *par_image,
> > +                                            size_t par_length)
>
> I don't think writing .par file belongs here, why not to do it in the caller?

It would make the code much harder if frm or par file creation failed.
I first had it in the caller, but the error handling got so complex so it
was much easier to do it here.

> Or, better, don't send .par file to S3 and don't discover it, it'll make
> everything much simpler. .par can be created from the .frm, see sql_table.cc:

I don't think that's simpler.  If it's simpler, then
init_from_binary_frm() should
automatically notice that the frm contains a par file and create one.
I don't know how par files works and I don't really want to know, as
the future plan apparently is to embed that into the frm.

So this I leave for a rainy day when we don't have .par files anymore

> >  {
> >    TABLE_SHARE *share= this;
> >    uint new_frm_ver, field_pack_length, new_field_pack_flag;
> > @@ -1715,24 +1724,31 @@ int TABLE_SHARE::init_from_binary_frm_image(THD *thd, bool write,
> >    uint len;
> >    uint ext_key_parts= 0;
> >    plugin_ref se_plugin= 0;
> > -  bool vers_can_native= false;
> > +  bool vers_can_native= false, frm_created= 0;
> >    Field_data_type_info_array field_data_type_info_array;
> > -
> >    MEM_ROOT *old_root= thd->mem_root;
> >    Virtual_column_info **table_check_constraints;
> >    extra2_fields extra2;
> > -
> >    DBUG_ENTER("TABLE_SHARE::init_from_binary_frm_image");
> >
> >    keyinfo= &first_keyinfo;
> >    thd->mem_root= &share->mem_root;
> >
> > -  if (write && write_frm_image(frm_image, frm_length))
> > -    goto err;
> > -
> >    if (frm_length < FRM_HEADER_SIZE + FRM_FORMINFO_SIZE)
> >      goto err;
> >
> > +  if (write)
> > +  {
> > +#ifdef WITH_PARTITION_STORAGE_ENGINE
> > +    if (par_image)
> > +      if (write_par_image(par_image, par_length))
> > +        goto err;
> > +#endif
> > +    frm_created= 1;
> > +    if (write_frm_image(frm_image, frm_length))
> > +      goto err;
> > +  }
>
> why not to create it at the end? then you won't need to delete it on an error.

I have now moved it last. I normally prefer to do all related things
in one place
and it was logical to create .frm and .par file at the same place

<cut>

> > index ed9c5404a7f..1f99fd9d895 100644
> > --- a/storage/maria/ha_maria.cc
> > +++ b/storage/maria/ha_maria.cc
> > @@ -2724,7 +2724,7 @@ void ha_maria::drop_table(const char *name)
> >  {
> >    DBUG_ASSERT(file->s->temporary);
> >    (void) ha_close();
> > -  (void) maria_delete_table_files(name, 1, 0);
> > +  (void) maria_delete_table_files(name, 1, MY_WME);
>
> MYF(MY_WME) as always?

In this code the table is open, and thus the files should always exists.
Better to give an error if something goes wrong.
(I had some issue while testing that I didn't get errors for missing
files which caused
a lot of confusion).

Regards,
Monty


References