← Back to team overview

maria-developers team mailing list archive

Re: 2f9bbf392072: MDEV-18734 optimization by taking ownership via String::swap()

 

Sergei,

the updated patches are here:

https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28b72f1a

On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
>
> Hi, Aleksey!
>
> Don't be confused, the commit header and the comment come from your last
> commit, but the diff below includes all three commits that mention
MDEV-18734.
>
> On Nov 02, Aleksey Midenkov wrote:
> > revision-id: 2f9bbf392072 (mariadb-10.2.31-543-g2f9bbf392072)
> > parent(s): 1f4960e3f2ac
> > author: Aleksey Midenkov <midenok@xxxxxxxxx>
> > committer: Aleksey Midenkov <midenok@xxxxxxxxx>
> > timestamp: 2020-11-02 14:26:04 +0300
> > message:
> >
> > MDEV-18734 optimization by taking ownership via String::swap()
> >
> > ha_partition stores records in array of m_ordered_rec_buffer and uses
> > it for prio queue in ordered index scan. When the records are restored
> > from the array the blob buffers may be already freed or
> > rewritten. This can happen when blob buffers are cached in
> > Field_blob::value or Field_blob::read_value.
> >
> > Previous solution worked via copying blob buffer into
> > mem_root-allocated buffer.  That was not optimal as it requires more
> > memory and processing time.
> >
> > This patch avoids memory copy by taking temporary ownership of cached
> > blob buffers via String::swap().  It also fixes problem when
> > value.str_length is wrong after Field_blob::store().
>
>
> > diff --git a/mysql-test/suite/federated/federated_partition.test
b/mysql-test/suite/federated/federated_partition.test
> > index ef1e27ec5054..33ee025442f4 100644
> > --- a/mysql-test/suite/federated/federated_partition.test
> > +++ b/mysql-test/suite/federated/federated_partition.test
> > @@ -50,4 +50,30 @@ drop table federated.t1_2;
> >
> >  --echo End of 5.1 tests
> >
> > +--echo #
> > +--echo # MDEV-18734 ASAN heap-use-after-free upon sorting by blob
column from partitioned table
> > +--echo #
> > +connection slave;
> > +use federated;
> > +create table t1_1 (x int, b text, key(x));
> > +create table t1_2 (x int, b text, key(x));
> > +connection master;
> > +--replace_result $SLAVE_MYPORT SLAVE_PORT
> > +eval create table t1 (x int, b text, key(x)) engine=federated
> > +  partition by range columns (x) (
> > +  partition p1 values less than (40)
connection='mysql://root@127.0.0.1:$SLAVE_MYPORT/federated/t1_1',
> > +  partition pn values less than (maxvalue)
connection='mysql://root@127.0.0.1:$SLAVE_MYPORT/federated/t1_2'
> > +);
> > +insert t1 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7,
7), (8, 8);
> > +insert t1 select x + 8, x + 8 from t1;
> > +insert t1 select x + 16, x + 16 from t1;
> > +insert t1 select x + 49, repeat(x + 49, 100) from t1;
> > +--echo # This produces wrong result before MDEV-17573
> > +select * from t1;
>
> What do you mean, simple `select * from t1` produces incorrect result?

Removed that garbage line.

>
> > +flush tables;
> > +select x, b from t1 where x > 30 and x < 60 order by b;
>
> Is this ASAN-only test?

No, it produces wrong result in 10.2 for me.

>
> > +drop table t1;
> > +connection slave;
> > +drop table t1_1, t1_2;
> > +
> >  source include/federated_cleanup.inc;
> > diff --git a/mysql-test/suite/vcol/t/partition.test
b/mysql-test/suite/vcol/t/partition.test
> > index 889724fb1c5c..f10ee0491ccc 100644
> > --- a/mysql-test/suite/vcol/t/partition.test
> > +++ b/mysql-test/suite/vcol/t/partition.test
> > @@ -30,3 +30,28 @@ subpartition by hash(v) subpartitions 3 (
> >  insert t1 set i= 0;
> >  set statement sql_mode= '' for update t1 set i= 1, v= 2;
> >  drop table t1;
> > +
> > +--echo #
> > +--echo # MDEV-18734 ASAN heap-use-after-free in
my_strnxfrm_simple_internal upon update on versioned partitioned table
> > +--echo #
> > +--echo # Cover queue_fix() in ha_partition::handle_ordered_index_scan()
> > +create table t1 (
> > +    x int auto_increment primary key,
> > +    b text, v mediumtext as (b) virtual,
> > +    index (v(10))
> > +) partition by range columns (x) (
> > +    partition p1 values less than (3),
> > +    partition p2 values less than (6),
> > +    partition pn values less than (maxvalue));
> > +insert into t1 (b) values ('q'), ('a'), ('b');
> > +update t1 set b= 'bar' where v > 'a';
> > +drop table t1;
>
> This test didn't fail for me in an ASAN build
> on the vanilla 10.5 without your fix.
> That is, I'm not sure it actually tests the fix.

It failed on 10.3 for me some time ago. Ok, replaced that with the bigger
version which fails on non-ASAN optimized all right.

>
> > +
> > +--echo # Cover return_top_record() in
ha_partition::handle_ordered_index_scan()
> > +create table t1 (x int primary key, b tinytext, v text as (b) virtual)
> > +partition by range columns (x) (
> > +  partition p1 values less than (4),
> > +  partition pn values less than (maxvalue));
> > +insert into t1 (x, b) values (1, ''), (2, ''), (3, 'a'), (4, 'b');
> > +update t1 set b= 'bar' where x > 0 order by v limit 2;
>
> This fails without a fix all right.
>
> Still, I wonder whether it's possible to create a test that'll fail
> in a normal optimized build w/o ASAN

The new case fails on non-ASAN optimized as well.

>
> > +drop table t1;
> > diff --git a/sql/field.cc b/sql/field.cc
> > index bdaaecc20269..0fd40c979d2c 100644
> > --- a/sql/field.cc
> > +++ b/sql/field.cc
> > @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint
length,CHARSET_INFO *cs)
> >    copy_length= copier.well_formed_copy(field_charset,
> >                                         (char*) value.ptr(), new_length,
> >                                         cs, from, length);
> > +  value.length(copy_length);
>
> good! Could this be a distinct bug with its own test case?

Do we need to spend time on that bureaucracy? There are real problems to
solve... Tickets drain time from the management people and generally from
anyone who sees them.

>
> >    Field_blob::store_length(copy_length);
> >    bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
> >
> > diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc
> > index 9d82df0235c7..7c658082d397 100644
> > --- a/sql/ha_partition.cc
> > +++ b/sql/ha_partition.cc
> > @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue()
> >    uint alloc_len;
> >    uint used_parts= bitmap_bits_set(&m_part_info->read_partitions);
> >    /* Allocate record buffer for each used partition. */
> > -  m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS;
> > +  m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET;
> >    if (!m_using_extended_keys)
> >        m_priority_queue_rec_len += m_file[0]->ref_length;
> >    alloc_len= used_parts * m_priority_queue_rec_len;
> >    /* Allocate a key for temporary use when setting up the scan. */
> >    alloc_len+= table_share->max_key_length;
> >
> > -  if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len,
MYF(MY_WME))))
> > +  if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root,
alloc_len)))
>
> I don't see why you need a memroot here. memroot is needed when you
> allocate later at difeerent points in time some initially unpredictable
> amount of memory that has the same lifetime and needs to be
> freed all at once.

I know... This is just another convenient way of allocation with presumably
minimized number of mallocs (probably pre_alloc_size should be tweaked to a
better value).

>
> But here it seems that you only allocate once. You could just make
>
>    m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS +
>               table->s->blob_fields * sizeof(Ordered_blob_storage)

I don't wnat this data serialized. It is inconvenient to read the dump of
the record. Why we should make it harder to code, read the code and debug
while we can make it easier? If it is only for guaranteed 1 malloc, I don't
think it's worth it.

>
> (plus necessary alignment).
>
> >      DBUG_RETURN(true);
> >
> >    /*
> > @@ -6178,7 +6203,11 @@ int
ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order)
> >        */
> >        error= file->read_range_first(m_start_key.key? &m_start_key:
NULL,
> >                                      end_range, eq_range, TRUE);
> > -      memcpy(rec_buf_ptr, table->record[0], m_rec_length);
> > +      if (!error)
> > +      {
> > +        memcpy(rec_buf_ptr, table->record[0], m_rec_length);
> > +      }
>
> did you have a bug because of that? something didn't work?

No, I just saw worthless memcpy and avoided it.

>
> > +
> >        reverse_order= FALSE;
> >        break;
> >      }
> > @@ -6310,6 +6349,43 @@ int
ha_partition::handle_ordered_index_scan_key_not_found()
> >  }
> >
> >
> > +void ha_partition::swap_blobs(uchar * rec_buf, Ordered_blob_storage **
storage, bool restore)
> > +{
> > +  uint *ptr, *end;
> > +  uint blob_n= 0;
> > +  table->move_fields(table->field, rec_buf, table->record[0]);
> > +  for (ptr= table->s->blob_field, end= ptr + table->s->blob_fields;
> > +       ptr != end; ++ptr, ++blob_n)
> > +  {
> > +    DBUG_ASSERT(*ptr < table->s->fields);
> > +    Field_blob *blob= (Field_blob*) table->field[*ptr];
> > +    DBUG_ASSERT(blob->flags & BLOB_FLAG);
> > +    DBUG_ASSERT(blob->field_index == *ptr);
> > +    if (!bitmap_is_set(table->read_set, *ptr) || blob->is_null())
> > +      continue;
> > +
> > +    Ordered_blob_storage &s= *storage[blob_n];
> > +
> > +    if (restore)
> > +    {
> > +      if (!s.blob.is_empty())
> > +        blob->swap(s.blob, s.set_read_value);
>
> looks somewhat risky, why blob.length == 0 means that you don't
> need to restore the value?

We store only non-empty blobs therefore we restore non-empty blobs. We can
swap empty blobs too but why?

>
> > +    }
> > +    else
> > +    {
> > +      bool set_read_value;
> > +      String *cached= blob->cached(set_read_value);
> > +      if (cached)
> > +      {
> > +        cached->swap(s.blob);
> > +        s.set_read_value= set_read_value;
>
> is it indeed possible here for a value to be either in `value` or in
> `read_value` ?
>
> When happens what?

Yes, it uses `value` for plain fields and `read_value` for virtual fields.

>
> > +      }
> > +    }
> > +  }
> > +  table->move_fields(table->field, table->record[0], rec_buf);
> > +}
> > +
> > +
> >  /*
> >    Common routine to handle index_next with ordered results
> >
>
> Regards,
> Sergei
> VP of MariaDB Server Engineering
> and security@xxxxxxxxxxx



--
All the best,

Aleksey Midenkov
@midenok

Follow ups

References