← Back to team overview

maria-developers team mailing list archive

Re: 2f9bbf392072: MDEV-18734 optimization by taking ownership via String::swap()

 

Hi Sergei,

On Tue, Dec 22, 2020 at 7:58 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
>
> Hi, Aleksey!
>
> On Nov 03, Aleksey Midenkov wrote:
> >
> > the updated patches are here:
> >
> > https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28b72f1a
>
> I started writing my comments there, but realized that it'd be clearer
> if I reply here, more context. That is, see below.
>
> One comment about the new test case, you have 16K lines in the result
> file, are all 16K necessary? Can you do something like
>
>    select x, left(b,10), left(v,10)
>
> ?

Yes, these shorter result cases reproduce as well. Updated.

>
> And please, pretty please, don't use non-constant references as function
> arguments.

Why? With pointers we can pass NULL and we don't need NULL there, do we?

>
> > On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
> > >
> > > > diff --git a/sql/field.cc b/sql/field.cc
> > > > index bdaaecc20269..0fd40c979d2c 100644
> > > > --- a/sql/field.cc
> > > > +++ b/sql/field.cc
> > > > @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs)
> > > >    copy_length= copier.well_formed_copy(field_charset,
> > > >                                         (char*) value.ptr(), new_length,
> > > >                                         cs, from, length);
> > > > +  value.length(copy_length);
> > >
> > > good! Could this be a distinct bug with its own test case?
> >
> > Do we need to spend time on that bureaucracy? There are real problems to
> > solve... Tickets drain time from the management people and generally from
> > anyone who sees them.
>
> I didn't mean opening an MDEV. Only writing a test case and extracting
> it into a separate commit.
>
> If this is a genuine bug fix and you don't create a test case - how can
> you be sure the fix will stay? It may disappear in a merge or a rebase,
> manually or automatically. Or in a refactoring that "looks fine, all
> tests pass". If you can create a test case - please, do.

It is tested by federated.federated_partition test case. Why do we
have to extract this side-fix to a separate commit? Looks anyway as a
bureaucracy to me. The main bug is critical, the test case tests both
bugs nicely. If the partitioning code is refactored, the test case
still tests that side-fix.

>
> > > >    Field_blob::store_length(copy_length);
> > > >    bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
> > > >
> > > > diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc
> > > > index 9d82df0235c7..7c658082d397 100644
> > > > --- a/sql/ha_partition.cc
> > > > +++ b/sql/ha_partition.cc
> > > > @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue()
> > > >    uint alloc_len;
> > > >    uint used_parts= bitmap_bits_set(&m_part_info->read_partitions);
> > > >    /* Allocate record buffer for each used partition. */
> > > > -  m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS;
> > > > +  m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET;
> > > >    if (!m_using_extended_keys)
> > > >        m_priority_queue_rec_len += m_file[0]->ref_length;
> > > >    alloc_len= used_parts * m_priority_queue_rec_len;
> > > >    /* Allocate a key for temporary use when setting up the scan. */
> > > >    alloc_len+= table_share->max_key_length;
> > > >
> > > > -  if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME))))
> > > > +  if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
> > >
> > > I don't see why you need a memroot here. memroot is needed when you
> > > allocate later at difeerent points in time some initially unpredictable
> > > amount of memory that has the same lifetime and needs to be
> > > freed all at once.
> >
> > I know... This is just another convenient way of allocation with presumably
> > minimized number of mallocs (probably pre_alloc_size should be tweaked to a
> > better value).
>
> memroot wasn't created for that. There are approaches with much less
> overhead

Using memory pools like that is a nice design pattern for non-critical
sections. At least it is nicer than manual pointer management. I
believe I've seen the similar pattern in InnoDB with its memory pool
mem_heap_t even without preallocation! It would be nice to have a
special class that preallocates and manages pointers, but I do not see
the reason to develop that for this specific use case.

I don't believe in *much* overhead: mem_root is a simplest algorithm,
a couple of ifs and one loop. Initialization call
init_record_priority_queue() is not a critical section, the major
execution is mostly busy by a data copy in
handle_ordered_index_scan(), handle_ordered_next().

Let's see gprof stats collected for 100k SELECTs attached in stats.txt

init_record_priority_queue() takes 0.3%; handle_ordered_index_scan()
takes 2.3%; handle_ordered_next() takes 12.3%.

Let's see unpatched stats for the same 100k SELECTs in stats_vanilla.txt

init_record_priority_queue() takes 0.1%; handle_ordered_index_scan()
takes 2.2%; handle_ordered_next() takes 11.4%.

My version of init_record_priority_queue() lost 0.2%. From details we
see that mem_root took 6 out of 13 which is 0.15%.

One should keep in mind this is an artificial usage of the prio queue.
In diversified real-world usage the share of such kind SELECT is say
0.1%. But even if it is 1% we divide our stats by 100. So, do you
really care about 0.0015%?

>
> > >
> > > But here it seems that you only allocate once. You could just make
> > >
> > >    m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS +
> > >               table->s->blob_fields * sizeof(Ordered_blob_storage)
> >
> > I don't wnat this data serialized. It is inconvenient to read the dump of
> > the record. Why we should make it harder to code, read the code and debug
> > while we can make it easier? If it is only for guaranteed 1 malloc, I don't
> > think it's worth it.
>
> What do you mean "serialized"?

A sort of memory organization when data is placed into a single
contiguous buffer.

> What do you mean "read the dump"?

x command in GDB.

>
> Note that you can also use my_multi_malloc, that allocates many buffers
> together and hides address arithmetcs from the caller. Like
>
>   my_multi_malloc(MYF(MY_WME),
>      &m_ordered_rec_buffer, m_rec_length + PARTITION_BYTES_IN_POS,
>      &blob_storage, table->s->blob_fields * sizeof(Ordered_blob_storage),
>      NULL)
>
> (but with error checking). And then, perhaps, with placement new[]
> you could construct all Ordered_blob_storage objects in one line.

Sorry, I don't understand how my version is much worse.

>
> > > > @@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order)
> > > >        */
> > > >        error= file->read_range_first(m_start_key.key? &m_start_key: NULL,
> > > >                                      end_range, eq_range, TRUE);
> > > > -      memcpy(rec_buf_ptr, table->record[0], m_rec_length);
> > > > +      if (!error)
> > > > +      {
> > > > +        memcpy(rec_buf_ptr, table->record[0], m_rec_length);
> > > > +      }
> > >
> > > did you have a bug because of that? something didn't work?
> >
> > No, I just saw worthless memcpy and avoided it.
>
> You're right that memcopy is unnecessary in case of an error.
> But most of the time there is no error. You made an exceptional code path
> faster by making a common code path slower. I think it's generally
> better to do the opposite, making the common, no-error code path
> faster, fewer conditionals, in particular. At the cost of making the error
> path slower, if needed.

I'm usually aware of such things in my decisions. The slowdown is
close to 0, it is not even 0.0001%, it is much lower. OTOH memcpy()
slowdown is much higher and this is not a fully exceptional code path:
HA_ERR_KEY_NOT_FOUND continues the loop.

>
> > > > @@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found()
> > > > +      if (cached)
> > > > +      {
> > > > +        cached->swap(s.blob);
> > > > +        s.set_read_value= set_read_value;
> > >
> > > is it indeed possible here for a value to be either in `value` or in
> > > `read_value` ?
> > >
> > > When happens what?
> >
> > Yes, it uses `value` for plain fields and `read_value` for virtual fields.
>
> Oh, I see.
>
> This read_value was introduced in ea1b25046c81db8fdf to solve the case
> of UPDATE, where a row is moved with store_record(table, record[1]) and
> blob pointers aren't updated. This is very similar to what you're
> fixing, perhaps read_value should go away and both bugs should have a
> one unified solution that allows to "park" a record somewhere out of
> record[0] and restore later, all with proper blob handling?

Do you think this is necessary now? I believe there are a lot of
*much* more important tasks.

>
> > >
> > > > +      }
> > > > +    }
> > > > +  }
> > > > +  table->move_fields(table->field, table->record[0], rec_buf);
> > > > +}
>
> Regards,
> Sergei
> VP of MariaDB Server Engineering
> and security@xxxxxxxxxxx


-- 
All the best,

Aleksey Midenkov
@midenok
                0.02    0.13  100000/100000      ha_partition::index_init(unsigned int, bool) <cycle 3> [262]
[289]    0.3    0.02    0.13  100000         ha_partition::init_record_priority_queue() [289]
                0.00    0.04  800000/5602760     Sql_alloc::operator new(unsigned long, st_mem_root*) [176]
                0.00    0.02  500000/15013727     alloc_root [109]
                0.00    0.02  800000/800000      Ordered_blob_storage::Ordered_blob_storage() [883]
                0.00    0.01  100000/100000      init_prealloc_root [1101]
                0.00    0.01  100000/100003      init_queue [1120]
                0.00    0.01  400000/10400882     bitmap_get_next_set [225]
                0.00    0.00  100000/1700086     bitmap_get_first_set [691]
                0.00    0.00  400000/218144857     _db_pargs_ [86]
                0.00    0.00  100000/237505826     _db_return_ [42]
                0.00    0.00  100000/237505806     _db_enter_ [43]
                0.00    0.00  100000/200000      bitmap_bits_set [4743]

                0.05    1.12  100000/100000      ha_partition::common_index_read(unsigned char*, bool) [82]
[87]     2.3    0.05    1.12  100000         ha_partition::handle_ordered_index_scan(unsigned char*, bool) [87]
                0.02    1.04  400000/400000      handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool) [94]
                0.01    0.02  300000/5800000     ha_partition::swap_blobs(unsigned char*, Ordered_blob_storage**, bool) [125]
                0.00    0.01  100000/2900000     ha_partition::return_top_record(unsigned char*) [195]
                0.00    0.01  399999/10400882     bitmap_get_next_set [225]
                0.00    0.00  299999/5799999     ha_myisam::position(unsigned char const*) [456]
                0.00    0.00  100000/100000      queue_fix [1668]
                0.00    0.00  600000/218144857     _db_pargs_ [86]
                0.00    0.00  100000/1700086     bitmap_get_first_set [691]
                0.00    0.00  100000/237505826     _db_return_ [42]
                0.00    0.00  100000/237505806     _db_enter_ [43]
                0.00    0.00  100000/20700271     bitmap_is_set [766]

                0.06    6.19 2899999/2899999     ha_partition::read_range_next() [26]
[27]    12.3    0.06    6.19 2899999         ha_partition::handle_ordered_next(unsigned char*, bool) [27]
                0.10    5.26 2900000/2900000     handler::read_range_next() [36]
                0.02    0.26 2800000/2900000     ha_partition::return_top_record(unsigned char*) [195]
                0.07    0.17 2600000/5800000     ha_partition::swap_blobs(unsigned char*, Ordered_blob_storage**, bool) [125]
                0.07    0.03 2600000/2700000     _downheap [397]
                0.03    0.04  300000/300000      queue_remove [486]
                0.02    0.03 2900000/237505826     _db_return_ [42]
                0.02    0.03 2899999/237505806     _db_enter_ [43]
                0.02    0.02 2600000/5799999     ha_myisam::position(unsigned char const*) [456]
                0.01    0.01 2800000/218144857     _db_pargs_ [86]

                0.01    0.04  100000/100000      ha_partition::index_init(unsigned int, bool) <cycle 3> [417]
[544]    0.1    0.01    0.04  100000         ha_partition::init_record_priority_queue() [544]
                0.01    0.01  100000/100003      init_queue [853]
                0.01    0.00  400000/10000879     bitmap_get_next_set [246]
                0.00    0.01  100000/4010970     my_malloc [187]
                0.00    0.00  400000/215344917     _db_pargs_ [84]
                0.00    0.00  100000/236005936     _db_enter_ [42]
                0.00    0.00  100000/236005953     _db_return_ [45]
                0.00    0.00  100000/1600086     bitmap_get_first_set [847]
                0.00    0.00  100000/200000      bitmap_bits_se

                0.03    1.06  100000/100000      ha_partition::common_index_read(unsigned char*, bool) [90]
[92]     2.2    0.03    1.06  100000         ha_partition::handle_ordered_index_scan(unsigned char*, bool) [92]
                0.00    1.04  400000/400000      handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool) [98]
                0.01    0.00  400000/10000879     bitmap_get_next_set [246]
                0.00    0.00  300000/5800000     ha_myisam::position(unsigned char const*) [416]
                0.00    0.00  600000/215344917     _db_pargs_ [84]
                0.00    0.00  100000/100000      queue_fix [1724]
                0.00    0.00  100000/236005936     _db_enter_ [42]
                0.00    0.00  100000/236005953     _db_return_ [45]
                0.00    0.00  100000/1600086     bitmap_get_first_set [847]
                0.00    0.00  100000/2900000     ha_partition::return_top_record(unsigned char*) [1207]
                0.00    0.00  100000/9100265     bitmap_is_set [870]

                0.13    5.46 2900000/2900000     ha_partition::read_range_next() [29]
[30]    11.4    0.13    5.46 2900000         ha_partition::handle_ordered_next(unsigned char*, bool) [30]
                0.05    5.19 2900000/2900000     handler::read_range_next() [35]
                0.03    0.03 2600000/2700000     _downheap [516]
                0.02    0.03 2900000/236005936     _db_enter_ [42]
                0.01    0.03 2900000/236005953     _db_return_ [45]
                0.00    0.03 2600000/5800000     ha_myisam::position(unsigned char const*) [416]
                0.01    0.01  300000/300000      queue_remove [834]
                0.01    0.01 2800000/215344917     _db_pargs_ [84]
                0.01    0.00 2800000/2900000     ha_partition::return_top_record(unsigned char*) [1207]

Follow ups

References