maria-developers team mailing list archive

Thread
Date
Re: Analysing MariaDB 5.5 sysbench performance regression

To: "'Kristian Nielsen'" <knielsen@xxxxxxxxxxxxxxx>, <maria-developers@xxxxxxxxxxxxxxxxxxx>
From: "Vladislav Vaintroub" <wlad@xxxxxxxxxxxxxxxx>
Date: Tue, 20 Mar 2012 14:22:55 +0100
In-reply-to: <878vivsgn3.fsf@frigg.knielsen-hq.org>
Thread-index: AQFL9BnpaZ22isDWrdJf6z0uKrdFBpd1RC8A
Coincidentally, I'm running the some performance comparisons of 5.2 vs 5.3
with a sampling profiler (it is not an oprofile, and is not perf, and I do
not think anyone but me will run it)
I'm running the simplest possible benchmark here - sysbench 0.5 using
prepared SELECT 1 statement, 64 users. Its performance is 80% compared to
5.2 baseline.
The things I see here that can be simply avoided

1. Initialization of large arrays of objects (setting up virtual function
pointers) : lp:941889, expensive constructors in JOIN, which was inlined by
GCC.
This one is not as conclusive as I hoped it would be. On some machines it
brings as much as 15%  , on others merely 1%.

2. Once JOIN  constructor is fixed to avoid virtual objects here, comparing
the profile shows differences in heap allocations. I think alloc_root in
Kristians profile has to do with it, too. Inclusive sample is grown by a
high number, however exclusive samples are almost all in RtlAllocateHeap and
friends (yes, I'm profiling on Windows). The callstacks point to the same
JOIN allocator.

My conclusion:
JOIN has simply grown too large, to the point where malloc()s do not
optimize anymore. Heap allocators (whether Windows or not) optimize well for
small sizes, and no more for huge things like JOIN.
To prove the theory I changed the #define MAX_TABLES in unireg.h to 10
instead of 64, which

has simply grown too large in 5.3.  As I see it is also the main reason why
Kristian sees grown alloc_root proportion.
I can "fix" it by defining MAX_TABLES in unireg.h to 1 instead of 61

3. Seemingly easy thing:  This is not 5.3 specific, but seems to be strange
to me anyway - on each select, even on "SELECT 1" 8KB of memory is allocated
and freed, in sql_base.cc , in open_tables()
init_sql_alloc(&new_frm_mem, 8024, 8024);

If someone knows why it is necessary, I'd like to hear an opinion. I can
change it to init_sql_alloc(&new_frm_mem, 8024, 0); to avoid allocations.


With all optimizations (and a hack of redefining MAX_TABLES to 10), I get
the kind of performance we have had in 5.2, with 1% less throughput on my
"SELECT 1" test, not 25% as original.
Other things I noticed, that are not so much performance relevant, but still
worth noting- time measuring -  THD::set_time, QueryPerformanceCounter can
be seen in 5.3 profile (not a big deal, 2%). They were not in 5.2.

Wlad

> -----Original Message-----
> From: maria-developers-
> bounces+wlad=montyprogram.com@xxxxxxxxxxxxxxxxxxx [mailto:maria-
> developers-bounces+wlad=montyprogram.com@xxxxxxxxxxxxxxxxxxx] On
> Behalf Of Kristian Nielsen
> Sent: Dienstag, 20. März 2012 11:37
> To: maria-developers@xxxxxxxxxxxxxxxxxxx
> Subject: [Maria-developers] Analysing MariaDB 5.5 sysbench performance
> regression
> 
> I have been analysing the performance difference between MariaDB 5.5.21
> and
> MySQL 5.5.21 using oltp.lua in sysbench 0.5. It turns out that MySQL has
> about
> 8% higher tps in this benchmark when running CPU bound. (See below for
> links to
> blog posts that started this).
> 
> I took the base sysbench run and tweaked it to get stable numbers -
basically
> put the datadir in ramdisk and adjusted my.cnf a bit (more details
> below). Then I used OProfile to analyse where the difference comes from.
> 
> Oprofile is a really great tool for this kind of analysis, where CPU time
> spent is the measurement. It is very easy to use, and it gives detailed
and
> accurate measurements on how much time is spent in each function. It is
> then
> easy to compare this against MySQL and find where we have performance
> regressions. Then further details is available by adding call graph to the
> reports, which shows from where the expensive parts are called; and by
> using
> the annotation facility which annotates every function and puts execution
> times on the individual statement and assembler functions.
> 
> Here is an overview of where the extra time is spent. Time is in units of
> samples, which is 1/100000 of a clock cycle. Total was around 4.15M
> samples. Here are the top functions I found used more time in MariaDB, and
> the
> amount of extra samples spent in MariaDB:
> 
>     39k   mysql_select
>     20k   alloc_root
>     16k   JOIN::choose_subquery_plan
>     10k   thd_ha_data
>     10k   make_sortkey()
>      9k   get_best_combination
>      9k   handler::read_range_next
>      7k   handler::ha_thd
>      7k   update_ref_and_keys
>      7k   Item::cmp_type
>      6k   Item::operator delete
>      5k   ha_maria::implicit_commit
>      4k   end_write
>      4k   sort_and_filter_keyuse
>      4k   my_interval_timer
>      4k   st_select_lex::master_unit
>      4k   st_select_lex::save_leaf_tables
>      4k   sel_arg_range_seq_next
>      4k   thd_test_options
>      4k   Item_func::walk
>      4k   TABLE::init
>      4k   sub_select
>      3k   strnmov
>      3k   ut_free
>      3k   vio_write
>      3k   ut_malloc_low
>      2k   setup_tables
>      2k   create_ref_for_key
> 
> Some care is needed to interpret these data. GCC -O3 does heavy
> inlining. Oprofile is a statistical profiler, so numbers are somewhat
> uncertain, more so for smaller numbers. And modern CPUs are superscalar,
> out-of-order, and hyper-threaded, which makes interpreting individual
> instruction costs higly challenging.
> 
> But the list can be used to drill down into the actual code; opannotate
> --source --assembler is very enlightening, in my experience (few examples
> below).
> 
> >From this list, it seems that - not surprising - the 5.3 optimiser
features
> seem to have added a bit of overhead. Much of this is spread out over many
> places in the code, though it could be that the root cause comes from few
> inlined functions.
> 
> I checked mysql_select(), and most of the overhead is from lp:941889,
> expensive constructors in JOIN, which was inlined by GCC. I believe Sergey
> plans to fix this. More analysis of the individual functions is needed to
find
> the cause there. It seems we are also using alloc_root() somewhat more.
> 
> thd_ha_data and maybe handler_thd are likely due to using dynamically
> loaded
> innodb_plugin rather than built-in.
> 
> I would like to do more analysis (and would probably need help from
> optimiser-knowledgable people for most of it) - I think it is feasible to
> remove most of the overhead in a modest time frame.
> 
> But my main point is to present these results, to show that we have a very
> useful tool to pin-point these performance issues, and to start discussion
in
> more detail (and I have plenty more of those).
> 
> -----------------------------------------------------------------------
> 
> More background:
> 
> These benchmarks started from these blog posts:
> 
>     http://www.mysqlperformanceblog.com/2012/02/18/mariadb-5-3-4-
> benchmarks/
>     http://blog.montyprogram.com/benchmarking-mariadb-5-3-4/
> 
> The main difference in these tests seem to be caused by different XtraDB
> buffer pool flush algorithm. So Axel tried using innodb_pluing in MariaDB
> instead, and got much smaller differences - but a difference was still
there.
> 
> So for 5.5 I took Axel's benchmark, adjusted until I got stable, fully
> cpu-bound numbers, and analysed the results, as above. I have the actual
> scripts and my.cnf available if anyone is interested. Main point is
datadir in
> ramdisk, reducing redo log size to 300MB total, and running for 30
seconds.
> 
> As examples, I analysed a handful of cases, and came up with two simple
> patches for specific issues.
> 
> One is update_virtual_fields(), which originally accounted for 0.5% of
total
> time in sysbench. This seems to be called for every row, so just the
push/pop
> of registers in function prologue became significant. I fixed this with
below
> patch; it puts a NULL pointer into the table instead of an empty list of
> virtual fields, and guards a few frequent calls with an if (NULL) check.
> 
> Another is my_hash_sort_simple(). This is not a regression, but Oprofile
> shows
> we spend 10% of total time in this function.
> 
> opannotate of the assembler showed the problem. Below simple patch
> speeds it
> up by 33%, by reducing 4 memory loads per loop to 2. Still, it would be
nice
> to understand why this is called that much - here is the call stack:
> 
>     my_hash_sort_simple()
>     ha_heap::write_row().
>     evaluate_join_record()
>     sub_select()
>     do_select()
>     JOIN::exec()
>     mysql_select()
> 
> It would be nice to understand why this is needed - this is an expensive
> function, it needs to lookup every character in the collation tables. If
this
> could be eliminated, that alone would probably be enough to win over
> MySQL,
> even without fixing any of the regressions.
> 
> Some other top contenders in the profile are innodb functions. For example
> buf_calc_page_new_checksum(). This is expensive, as it needs to loop over
> every byte in each buffer pool page (16K), computing a 7-long dependency
> chain
> for each byte. This could be much faster if the checksum was done on
8-byte
> longwords, but AFAIK, it's not trivial to change the checksum algorithm
due
> to
> the migration issues. Facebook may have done some work on this.
> 
> Another top contender is row_sel_store_mysql_rec() and similar - they show
> high costs in the opannotate on loads from the records, this is probably
cache
> misses from accessing pages in the buffer pool - not too much to do about
> that
> I think.
> 
> Hope this helps,
> 
>  - Kristian.
> 
> -----------------------------------------------------------------------
> Patch for my_hash_sort_simple():
> 
> === modified file 'strings/ctype-simple.c'
> --- strings/ctype-simple.c	2012-01-13 14:50:02 +0000
> +++ strings/ctype-simple.c	2012-03-16 16:34:31 +0000
> @@ -306,19 +306,24 @@ void my_hash_sort_simple(CHARSET_INFO *c
>  {
>    register const uchar *sort_order=cs->sort_order;
>    const uchar *end;
> +  ulong n1, n2;
> 
>    /*
>      Remove end space. We have to do this to be able to compare
>      'A ' and 'A' as identical
>    */
>    end= skip_trailing_space(key, len);
> -
> +
> +  n1= *nr1;
> +  n2= *nr2;
>    for (; key < (uchar*) end ; key++)
>    {
> -    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) *
> -	     ((uint) sort_order[(uint) *key])) + (nr1[0] << 8);
> -    nr2[0]+=3;
> +    n1^=(ulong) ((((uint) n1 & 63)+n2) *
> +	     ((uint) sort_order[(uint) *key])) + (n1 << 8);
> +    n2+=3;
>    }
> +  *nr1= n1;
> +  *nr2= n2;
>  }
> 
> 
> -----------------------------------------------------------------------
> Patch for update_virtual_fields():
> 
> === modified file 'sql/filesort.cc'
> --- sql/filesort.cc	2012-03-11 22:45:18 +0000
> +++ sql/filesort.cc	2012-03-16 19:17:58 +0000
> @@ -589,7 +589,7 @@ static ha_rows find_all_keys(SORTPARAM *
>      {
>        if ((error= select->quick->get_next()))
>          break;
> -      if (!error)
> +      if (!error && sort_form->vfield)
>          update_virtual_fields(thd, sort_form);
>        file->position(sort_form->record[0]);
>        DBUG_EXECUTE_IF("debug_filesort", dbug_print_record(sort_form,
> TRUE););
> @@ -608,7 +608,7 @@ static ha_rows find_all_keys(SORTPARAM *
>        else
>        {
>  	error=file->ha_rnd_next(sort_form->record[0]);
> -	if (!error)
> +	if (!error && sort_form->vfield)
>  	  update_virtual_fields(thd, sort_form);
>  	if (!flag)
>  	{
> 
> === modified file 'sql/records.cc'
> --- sql/records.cc	2012-01-13 14:50:02 +0000
> +++ sql/records.cc	2012-03-16 19:15:51 +0000
> @@ -344,7 +344,8 @@ static int rr_quick(READ_RECORD *info)
>        break;
>      }
>    }
> -  update_virtual_fields(info->thd, info->table);
> +  if (info->table->vfield)
> +    update_virtual_fields(info->thd, info->table);
>    return tmp;
>  }
> 
> 
> === modified file 'sql/sql_select.cc'
> --- sql/sql_select.cc	2012-03-11 22:45:18 +0000
> +++ sql/sql_select.cc	2012-03-16 19:15:05 +0000
> @@ -15824,7 +15824,8 @@ evaluate_join_record(JOIN *join, JOIN_TA
>      DBUG_RETURN(NESTED_LOOP_KILLED);            /* purecov: inspected */
>    }
> 
> -  update_virtual_fields(join->thd, join_tab->table);
> +  if (join_tab->table->vfield)
> +    update_virtual_fields(join->thd, join_tab->table);
> 
>    if (select_cond)
>    {
> 
> === modified file 'sql/table.cc'
> --- sql/table.cc	2012-03-02 23:03:20 +0000
> +++ sql/table.cc	2012-03-18 16:10:18 +0000
> @@ -2466,30 +2466,35 @@ int open_table_from_share(THD *thd, TABL
>    /*
>      Process virtual columns, if any.
>    */
> -  if (!(vfield_ptr = (Field **) alloc_root(&outparam->mem_root,
> -                                          (uint) ((share->vfields+1)*
> -                                                  sizeof(Field*)))))
> -    goto err;
> -
> -  outparam->vfield= vfield_ptr;
> -
> -  for (field_ptr= outparam->field; *field_ptr; field_ptr++)
> +  if (!share->vfields)
> +    outparam->vfield= NULL;
> +  else
>    {
> -    if ((*field_ptr)->vcol_info)
> +    if (!(vfield_ptr = (Field **) alloc_root(&outparam->mem_root,
> +                                             (uint) ((share->vfields+1)*
> +                                                     sizeof(Field*)))))
> +      goto err;
> +
> +    outparam->vfield= vfield_ptr;
> +
> +    for (field_ptr= outparam->field; *field_ptr; field_ptr++)
>      {
> -      if (unpack_vcol_info_from_frm(thd,
> -                                    outparam,
> -                                    *field_ptr,
> -                                    &(*field_ptr)->vcol_info->expr_str,
> -                                    &error_reported))
> +      if ((*field_ptr)->vcol_info)
>        {
> -        error= 4; // in case no error is reported
> -        goto err;
> +        if (unpack_vcol_info_from_frm(thd,
> +                                      outparam,
> +                                      *field_ptr,
> +                                      &(*field_ptr)->vcol_info->expr_str,
> +                                      &error_reported))
> +        {
> +          error= 4; // in case no error is reported
> +          goto err;
> +        }
> +        *(vfield_ptr++)= *field_ptr;
>        }
> -      *(vfield_ptr++)= *field_ptr;
>      }
> +    *vfield_ptr= 0;                              // End marker
>    }
> -  *vfield_ptr= 0;                              // End marker
> 
>  #ifdef WITH_PARTITION_STORAGE_ENGINE
>    if (share->partition_info_str_len && outparam->file)
> 
> -----------------------------------------------------------------------
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~maria-developers
> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~maria-developers
> More help   : https://help.launchpad.net/ListHelp
References

Analysing MariaDB 5.5 sysbench performance regression
From: Kristian Nielsen, 2012-03-20