← Back to team overview

maria-developers team mailing list archive

Re: 2f9bbf392072: MDEV-18734 optimization by taking ownership via String::swap()

 

Hi, Aleksey!

Don't be confused, the commit header and the comment come from your last
commit, but the diff below includes all three commits that mention MDEV-18734.

On Nov 02, Aleksey Midenkov wrote:
> revision-id: 2f9bbf392072 (mariadb-10.2.31-543-g2f9bbf392072)
> parent(s): 1f4960e3f2ac
> author: Aleksey Midenkov <midenok@xxxxxxxxx>
> committer: Aleksey Midenkov <midenok@xxxxxxxxx>
> timestamp: 2020-11-02 14:26:04 +0300
> message:
> 
> MDEV-18734 optimization by taking ownership via String::swap()
> 
> ha_partition stores records in array of m_ordered_rec_buffer and uses
> it for prio queue in ordered index scan. When the records are restored
> from the array the blob buffers may be already freed or
> rewritten. This can happen when blob buffers are cached in
> Field_blob::value or Field_blob::read_value.
> 
> Previous solution worked via copying blob buffer into
> mem_root-allocated buffer.  That was not optimal as it requires more
> memory and processing time.
> 
> This patch avoids memory copy by taking temporary ownership of cached
> blob buffers via String::swap().  It also fixes problem when
> value.str_length is wrong after Field_blob::store().


> diff --git a/mysql-test/suite/federated/federated_partition.test b/mysql-test/suite/federated/federated_partition.test
> index ef1e27ec5054..33ee025442f4 100644
> --- a/mysql-test/suite/federated/federated_partition.test
> +++ b/mysql-test/suite/federated/federated_partition.test
> @@ -50,4 +50,30 @@ drop table federated.t1_2;
>  
>  --echo End of 5.1 tests
>  
> +--echo #
> +--echo # MDEV-18734 ASAN heap-use-after-free upon sorting by blob column from partitioned table
> +--echo #
> +connection slave;
> +use federated;
> +create table t1_1 (x int, b text, key(x));
> +create table t1_2 (x int, b text, key(x));
> +connection master;
> +--replace_result $SLAVE_MYPORT SLAVE_PORT
> +eval create table t1 (x int, b text, key(x)) engine=federated
> +  partition by range columns (x) (
> +  partition p1 values less than (40) connection='mysql://root@127.0.0.1:$SLAVE_MYPORT/federated/t1_1',
> +  partition pn values less than (maxvalue) connection='mysql://root@127.0.0.1:$SLAVE_MYPORT/federated/t1_2'
> +);
> +insert t1 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8);
> +insert t1 select x + 8, x + 8 from t1;
> +insert t1 select x + 16, x + 16 from t1;
> +insert t1 select x + 49, repeat(x + 49, 100) from t1;
> +--echo # This produces wrong result before MDEV-17573
> +select * from t1;

What do you mean, simple `select * from t1` produces incorrect result?

> +flush tables;
> +select x, b from t1 where x > 30 and x < 60 order by b;

Is this ASAN-only test?

> +drop table t1;
> +connection slave;
> +drop table t1_1, t1_2;
> +
>  source include/federated_cleanup.inc;
> diff --git a/mysql-test/suite/vcol/t/partition.test b/mysql-test/suite/vcol/t/partition.test
> index 889724fb1c5c..f10ee0491ccc 100644
> --- a/mysql-test/suite/vcol/t/partition.test
> +++ b/mysql-test/suite/vcol/t/partition.test
> @@ -30,3 +30,28 @@ subpartition by hash(v) subpartitions 3 (
>  insert t1 set i= 0;
>  set statement sql_mode= '' for update t1 set i= 1, v= 2;
>  drop table t1;
> +
> +--echo #
> +--echo # MDEV-18734 ASAN heap-use-after-free in my_strnxfrm_simple_internal upon update on versioned partitioned table
> +--echo #
> +--echo # Cover queue_fix() in ha_partition::handle_ordered_index_scan()
> +create table t1 (
> +    x int auto_increment primary key,
> +    b text, v mediumtext as (b) virtual,
> +    index (v(10))
> +) partition by range columns (x) (
> +    partition p1 values less than (3),
> +    partition p2 values less than (6),
> +    partition pn values less than (maxvalue));
> +insert into t1 (b) values ('q'), ('a'), ('b');
> +update t1 set b= 'bar' where v > 'a';
> +drop table t1;

This test didn't fail for me in an ASAN build
on the vanilla 10.5 without your fix.
That is, I'm not sure it actually tests the fix.

> +
> +--echo # Cover return_top_record() in ha_partition::handle_ordered_index_scan()
> +create table t1 (x int primary key, b tinytext, v text as (b) virtual)
> +partition by range columns (x) (
> +  partition p1 values less than (4),
> +  partition pn values less than (maxvalue));
> +insert into t1 (x, b) values (1, ''), (2, ''), (3, 'a'), (4, 'b');
> +update t1 set b= 'bar' where x > 0 order by v limit 2;

This fails without a fix all right.

Still, I wonder whether it's possible to create a test that'll fail
in a normal optimized build w/o ASAN

> +drop table t1;
> diff --git a/sql/field.cc b/sql/field.cc
> index bdaaecc20269..0fd40c979d2c 100644
> --- a/sql/field.cc
> +++ b/sql/field.cc
> @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs)
>    copy_length= copier.well_formed_copy(field_charset,
>                                         (char*) value.ptr(), new_length,
>                                         cs, from, length);
> +  value.length(copy_length);

good! Could this be a distinct bug with its own test case?

>    Field_blob::store_length(copy_length);
>    bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
>  
> diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc
> index 9d82df0235c7..7c658082d397 100644
> --- a/sql/ha_partition.cc
> +++ b/sql/ha_partition.cc
> @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue()
>    uint alloc_len;
>    uint used_parts= bitmap_bits_set(&m_part_info->read_partitions);
>    /* Allocate record buffer for each used partition. */
> -  m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS;
> +  m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET;
>    if (!m_using_extended_keys)
>        m_priority_queue_rec_len += m_file[0]->ref_length;
>    alloc_len= used_parts * m_priority_queue_rec_len;
>    /* Allocate a key for temporary use when setting up the scan. */
>    alloc_len+= table_share->max_key_length;
>  
> -  if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME))))
> +  if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))

I don't see why you need a memroot here. memroot is needed when you
allocate later at difeerent points in time some initially unpredictable
amount of memory that has the same lifetime and needs to be
freed all at once.

But here it seems that you only allocate once. You could just make

   m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS +
              table->s->blob_fields * sizeof(Ordered_blob_storage)

(plus necessary alignment).

>      DBUG_RETURN(true);
>  
>    /*
> @@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order)
>        */
>        error= file->read_range_first(m_start_key.key? &m_start_key: NULL,
>                                      end_range, eq_range, TRUE);
> -      memcpy(rec_buf_ptr, table->record[0], m_rec_length);
> +      if (!error)
> +      {
> +        memcpy(rec_buf_ptr, table->record[0], m_rec_length);
> +      }

did you have a bug because of that? something didn't work?

> +
>        reverse_order= FALSE;
>        break;
>      }
> @@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found()
>  }
>  
>  
> +void ha_partition::swap_blobs(uchar * rec_buf, Ordered_blob_storage ** storage, bool restore)
> +{
> +  uint *ptr, *end;
> +  uint blob_n= 0;
> +  table->move_fields(table->field, rec_buf, table->record[0]);
> +  for (ptr= table->s->blob_field, end= ptr + table->s->blob_fields;
> +       ptr != end; ++ptr, ++blob_n)
> +  {
> +    DBUG_ASSERT(*ptr < table->s->fields);
> +    Field_blob *blob= (Field_blob*) table->field[*ptr];
> +    DBUG_ASSERT(blob->flags & BLOB_FLAG);
> +    DBUG_ASSERT(blob->field_index == *ptr);
> +    if (!bitmap_is_set(table->read_set, *ptr) || blob->is_null())
> +      continue;
> +
> +    Ordered_blob_storage &s= *storage[blob_n];
> +
> +    if (restore)
> +    {
> +      if (!s.blob.is_empty())
> +        blob->swap(s.blob, s.set_read_value);

looks somewhat risky, why blob.length == 0 means that you don't
need to restore the value?

> +    }
> +    else
> +    {
> +      bool set_read_value;
> +      String *cached= blob->cached(set_read_value);
> +      if (cached)
> +      {
> +        cached->swap(s.blob);
> +        s.set_read_value= set_read_value;

is it indeed possible here for a value to be either in `value` or in
`read_value` ?

When happens what?

> +      }
> +    }
> +  }
> +  table->move_fields(table->field, table->record[0], rec_buf);
> +}
> +
> +
>  /*
>    Common routine to handle index_next with ordered results
>  

Regards,
Sergei
VP of MariaDB Server Engineering
and security@xxxxxxxxxxx


Follow ups