maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #12509
Re: 2f9bbf392072: MDEV-18734 optimization by taking ownership via String::swap()
Hi Sergei,
On Tue, Dec 22, 2020 at 7:58 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
>
> Hi, Aleksey!
>
> On Nov 03, Aleksey Midenkov wrote:
> >
> > the updated patches are here:
> >
> > https://github.com/MariaDB/server/commits/5f6bcd6cd0145513974b0dfc5b2cefba28b72f1a
>
> I started writing my comments there, but realized that it'd be clearer
> if I reply here, more context. That is, see below.
>
> One comment about the new test case, you have 16K lines in the result
> file, are all 16K necessary? Can you do something like
>
> select x, left(b,10), left(v,10)
>
> ?
Yes, these shorter result cases reproduce as well. Updated.
>
> And please, pretty please, don't use non-constant references as function
> arguments.
Why? With pointers we can pass NULL and we don't need NULL there, do we?
>
> > On Mon, Nov 2, 2020 at 11:33 PM Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
> > >
> > > > diff --git a/sql/field.cc b/sql/field.cc
> > > > index bdaaecc20269..0fd40c979d2c 100644
> > > > --- a/sql/field.cc
> > > > +++ b/sql/field.cc
> > > > @@ -8310,6 +8310,7 @@ int Field_blob::store(const char *from,uint length,CHARSET_INFO *cs)
> > > > copy_length= copier.well_formed_copy(field_charset,
> > > > (char*) value.ptr(), new_length,
> > > > cs, from, length);
> > > > + value.length(copy_length);
> > >
> > > good! Could this be a distinct bug with its own test case?
> >
> > Do we need to spend time on that bureaucracy? There are real problems to
> > solve... Tickets drain time from the management people and generally from
> > anyone who sees them.
>
> I didn't mean opening an MDEV. Only writing a test case and extracting
> it into a separate commit.
>
> If this is a genuine bug fix and you don't create a test case - how can
> you be sure the fix will stay? It may disappear in a merge or a rebase,
> manually or automatically. Or in a refactoring that "looks fine, all
> tests pass". If you can create a test case - please, do.
It is tested by federated.federated_partition test case. Why do we
have to extract this side-fix to a separate commit? Looks anyway as a
bureaucracy to me. The main bug is critical, the test case tests both
bugs nicely. If the partitioning code is refactored, the test case
still tests that side-fix.
>
> > > > Field_blob::store_length(copy_length);
> > > > bmove(ptr+packlength,(uchar*) &tmp,sizeof(char*));
> > > >
> > > > diff --git a/sql/ha_partition.cc b/sql/ha_partition.cc
> > > > index 9d82df0235c7..7c658082d397 100644
> > > > --- a/sql/ha_partition.cc
> > > > +++ b/sql/ha_partition.cc
> > > > @@ -5125,14 +5128,14 @@ bool ha_partition::init_record_priority_queue()
> > > > uint alloc_len;
> > > > uint used_parts= bitmap_bits_set(&m_part_info->read_partitions);
> > > > /* Allocate record buffer for each used partition. */
> > > > - m_priority_queue_rec_len= m_rec_length + PARTITION_BYTES_IN_POS;
> > > > + m_priority_queue_rec_len= m_rec_length + ORDERED_REC_OFFSET;
> > > > if (!m_using_extended_keys)
> > > > m_priority_queue_rec_len += m_file[0]->ref_length;
> > > > alloc_len= used_parts * m_priority_queue_rec_len;
> > > > /* Allocate a key for temporary use when setting up the scan. */
> > > > alloc_len+= table_share->max_key_length;
> > > >
> > > > - if (!(m_ordered_rec_buffer= (uchar*)my_malloc(alloc_len, MYF(MY_WME))))
> > > > + if (!(m_ordered_rec_buffer= (uchar*) alloc_root(&m_ordered_root, alloc_len)))
> > >
> > > I don't see why you need a memroot here. memroot is needed when you
> > > allocate later at difeerent points in time some initially unpredictable
> > > amount of memory that has the same lifetime and needs to be
> > > freed all at once.
> >
> > I know... This is just another convenient way of allocation with presumably
> > minimized number of mallocs (probably pre_alloc_size should be tweaked to a
> > better value).
>
> memroot wasn't created for that. There are approaches with much less
> overhead
Using memory pools like that is a nice design pattern for non-critical
sections. At least it is nicer than manual pointer management. I
believe I've seen the similar pattern in InnoDB with its memory pool
mem_heap_t even without preallocation! It would be nice to have a
special class that preallocates and manages pointers, but I do not see
the reason to develop that for this specific use case.
I don't believe in *much* overhead: mem_root is a simplest algorithm,
a couple of ifs and one loop. Initialization call
init_record_priority_queue() is not a critical section, the major
execution is mostly busy by a data copy in
handle_ordered_index_scan(), handle_ordered_next().
Let's see gprof stats collected for 100k SELECTs attached in stats.txt
init_record_priority_queue() takes 0.3%; handle_ordered_index_scan()
takes 2.3%; handle_ordered_next() takes 12.3%.
Let's see unpatched stats for the same 100k SELECTs in stats_vanilla.txt
init_record_priority_queue() takes 0.1%; handle_ordered_index_scan()
takes 2.2%; handle_ordered_next() takes 11.4%.
My version of init_record_priority_queue() lost 0.2%. From details we
see that mem_root took 6 out of 13 which is 0.15%.
One should keep in mind this is an artificial usage of the prio queue.
In diversified real-world usage the share of such kind SELECT is say
0.1%. But even if it is 1% we divide our stats by 100. So, do you
really care about 0.0015%?
>
> > >
> > > But here it seems that you only allocate once. You could just make
> > >
> > > m_priority_queue_rec_len = m_rec_length + PARTITION_BYTES_IN_POS +
> > > table->s->blob_fields * sizeof(Ordered_blob_storage)
> >
> > I don't wnat this data serialized. It is inconvenient to read the dump of
> > the record. Why we should make it harder to code, read the code and debug
> > while we can make it easier? If it is only for guaranteed 1 malloc, I don't
> > think it's worth it.
>
> What do you mean "serialized"?
A sort of memory organization when data is placed into a single
contiguous buffer.
> What do you mean "read the dump"?
x command in GDB.
>
> Note that you can also use my_multi_malloc, that allocates many buffers
> together and hides address arithmetcs from the caller. Like
>
> my_multi_malloc(MYF(MY_WME),
> &m_ordered_rec_buffer, m_rec_length + PARTITION_BYTES_IN_POS,
> &blob_storage, table->s->blob_fields * sizeof(Ordered_blob_storage),
> NULL)
>
> (but with error checking). And then, perhaps, with placement new[]
> you could construct all Ordered_blob_storage objects in one line.
Sorry, I don't understand how my version is much worse.
>
> > > > @@ -6178,7 +6203,11 @@ int ha_partition::handle_ordered_index_scan(uchar *buf, bool reverse_order)
> > > > */
> > > > error= file->read_range_first(m_start_key.key? &m_start_key: NULL,
> > > > end_range, eq_range, TRUE);
> > > > - memcpy(rec_buf_ptr, table->record[0], m_rec_length);
> > > > + if (!error)
> > > > + {
> > > > + memcpy(rec_buf_ptr, table->record[0], m_rec_length);
> > > > + }
> > >
> > > did you have a bug because of that? something didn't work?
> >
> > No, I just saw worthless memcpy and avoided it.
>
> You're right that memcopy is unnecessary in case of an error.
> But most of the time there is no error. You made an exceptional code path
> faster by making a common code path slower. I think it's generally
> better to do the opposite, making the common, no-error code path
> faster, fewer conditionals, in particular. At the cost of making the error
> path slower, if needed.
I'm usually aware of such things in my decisions. The slowdown is
close to 0, it is not even 0.0001%, it is much lower. OTOH memcpy()
slowdown is much higher and this is not a fully exceptional code path:
HA_ERR_KEY_NOT_FOUND continues the loop.
>
> > > > @@ -6310,6 +6349,43 @@ int ha_partition::handle_ordered_index_scan_key_not_found()
> > > > + if (cached)
> > > > + {
> > > > + cached->swap(s.blob);
> > > > + s.set_read_value= set_read_value;
> > >
> > > is it indeed possible here for a value to be either in `value` or in
> > > `read_value` ?
> > >
> > > When happens what?
> >
> > Yes, it uses `value` for plain fields and `read_value` for virtual fields.
>
> Oh, I see.
>
> This read_value was introduced in ea1b25046c81db8fdf to solve the case
> of UPDATE, where a row is moved with store_record(table, record[1]) and
> blob pointers aren't updated. This is very similar to what you're
> fixing, perhaps read_value should go away and both bugs should have a
> one unified solution that allows to "park" a record somewhere out of
> record[0] and restore later, all with proper blob handling?
Do you think this is necessary now? I believe there are a lot of
*much* more important tasks.
>
> > >
> > > > + }
> > > > + }
> > > > + }
> > > > + table->move_fields(table->field, table->record[0], rec_buf);
> > > > +}
>
> Regards,
> Sergei
> VP of MariaDB Server Engineering
> and security@xxxxxxxxxxx
--
All the best,
Aleksey Midenkov
@midenok
0.02 0.13 100000/100000 ha_partition::index_init(unsigned int, bool) <cycle 3> [262]
[289] 0.3 0.02 0.13 100000 ha_partition::init_record_priority_queue() [289]
0.00 0.04 800000/5602760 Sql_alloc::operator new(unsigned long, st_mem_root*) [176]
0.00 0.02 500000/15013727 alloc_root [109]
0.00 0.02 800000/800000 Ordered_blob_storage::Ordered_blob_storage() [883]
0.00 0.01 100000/100000 init_prealloc_root [1101]
0.00 0.01 100000/100003 init_queue [1120]
0.00 0.01 400000/10400882 bitmap_get_next_set [225]
0.00 0.00 100000/1700086 bitmap_get_first_set [691]
0.00 0.00 400000/218144857 _db_pargs_ [86]
0.00 0.00 100000/237505826 _db_return_ [42]
0.00 0.00 100000/237505806 _db_enter_ [43]
0.00 0.00 100000/200000 bitmap_bits_set [4743]
0.05 1.12 100000/100000 ha_partition::common_index_read(unsigned char*, bool) [82]
[87] 2.3 0.05 1.12 100000 ha_partition::handle_ordered_index_scan(unsigned char*, bool) [87]
0.02 1.04 400000/400000 handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool) [94]
0.01 0.02 300000/5800000 ha_partition::swap_blobs(unsigned char*, Ordered_blob_storage**, bool) [125]
0.00 0.01 100000/2900000 ha_partition::return_top_record(unsigned char*) [195]
0.00 0.01 399999/10400882 bitmap_get_next_set [225]
0.00 0.00 299999/5799999 ha_myisam::position(unsigned char const*) [456]
0.00 0.00 100000/100000 queue_fix [1668]
0.00 0.00 600000/218144857 _db_pargs_ [86]
0.00 0.00 100000/1700086 bitmap_get_first_set [691]
0.00 0.00 100000/237505826 _db_return_ [42]
0.00 0.00 100000/237505806 _db_enter_ [43]
0.00 0.00 100000/20700271 bitmap_is_set [766]
0.06 6.19 2899999/2899999 ha_partition::read_range_next() [26]
[27] 12.3 0.06 6.19 2899999 ha_partition::handle_ordered_next(unsigned char*, bool) [27]
0.10 5.26 2900000/2900000 handler::read_range_next() [36]
0.02 0.26 2800000/2900000 ha_partition::return_top_record(unsigned char*) [195]
0.07 0.17 2600000/5800000 ha_partition::swap_blobs(unsigned char*, Ordered_blob_storage**, bool) [125]
0.07 0.03 2600000/2700000 _downheap [397]
0.03 0.04 300000/300000 queue_remove [486]
0.02 0.03 2900000/237505826 _db_return_ [42]
0.02 0.03 2899999/237505806 _db_enter_ [43]
0.02 0.02 2600000/5799999 ha_myisam::position(unsigned char const*) [456]
0.01 0.01 2800000/218144857 _db_pargs_ [86]
0.01 0.04 100000/100000 ha_partition::index_init(unsigned int, bool) <cycle 3> [417]
[544] 0.1 0.01 0.04 100000 ha_partition::init_record_priority_queue() [544]
0.01 0.01 100000/100003 init_queue [853]
0.01 0.00 400000/10000879 bitmap_get_next_set [246]
0.00 0.01 100000/4010970 my_malloc [187]
0.00 0.00 400000/215344917 _db_pargs_ [84]
0.00 0.00 100000/236005936 _db_enter_ [42]
0.00 0.00 100000/236005953 _db_return_ [45]
0.00 0.00 100000/1600086 bitmap_get_first_set [847]
0.00 0.00 100000/200000 bitmap_bits_se
0.03 1.06 100000/100000 ha_partition::common_index_read(unsigned char*, bool) [90]
[92] 2.2 0.03 1.06 100000 ha_partition::handle_ordered_index_scan(unsigned char*, bool) [92]
0.00 1.04 400000/400000 handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool) [98]
0.01 0.00 400000/10000879 bitmap_get_next_set [246]
0.00 0.00 300000/5800000 ha_myisam::position(unsigned char const*) [416]
0.00 0.00 600000/215344917 _db_pargs_ [84]
0.00 0.00 100000/100000 queue_fix [1724]
0.00 0.00 100000/236005936 _db_enter_ [42]
0.00 0.00 100000/236005953 _db_return_ [45]
0.00 0.00 100000/1600086 bitmap_get_first_set [847]
0.00 0.00 100000/2900000 ha_partition::return_top_record(unsigned char*) [1207]
0.00 0.00 100000/9100265 bitmap_is_set [870]
0.13 5.46 2900000/2900000 ha_partition::read_range_next() [29]
[30] 11.4 0.13 5.46 2900000 ha_partition::handle_ordered_next(unsigned char*, bool) [30]
0.05 5.19 2900000/2900000 handler::read_range_next() [35]
0.03 0.03 2600000/2700000 _downheap [516]
0.02 0.03 2900000/236005936 _db_enter_ [42]
0.01 0.03 2900000/236005953 _db_return_ [45]
0.00 0.03 2600000/5800000 ha_myisam::position(unsigned char const*) [416]
0.01 0.01 300000/300000 queue_remove [834]
0.01 0.01 2800000/215344917 _db_pargs_ [84]
0.01 0.00 2800000/2900000 ha_partition::return_top_record(unsigned char*) [1207]
Follow ups
References