maria-developers team mailing list archive

Thread
Date
Analysing MariaDB 5.5 sysbench performance regression

To: maria-developers@xxxxxxxxxxxxxxxxxxx
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Tue, 20 Mar 2012 11:36:48 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)
I have been analysing the performance difference between MariaDB 5.5.21 and
MySQL 5.5.21 using oltp.lua in sysbench 0.5. It turns out that MySQL has about
8% higher tps in this benchmark when running CPU bound. (See below for links to
blog posts that started this).

I took the base sysbench run and tweaked it to get stable numbers - basically
put the datadir in ramdisk and adjusted my.cnf a bit (more details
below). Then I used OProfile to analyse where the difference comes from.

Oprofile is a really great tool for this kind of analysis, where CPU time
spent is the measurement. It is very easy to use, and it gives detailed and
accurate measurements on how much time is spent in each function. It is then
easy to compare this against MySQL and find where we have performance
regressions. Then further details is available by adding call graph to the
reports, which shows from where the expensive parts are called; and by using
the annotation facility which annotates every function and puts execution
times on the individual statement and assembler functions.

Here is an overview of where the extra time is spent. Time is in units of
samples, which is 1/100000 of a clock cycle. Total was around 4.15M
samples. Here are the top functions I found used more time in MariaDB, and the
amount of extra samples spent in MariaDB:

    39k   mysql_select
    20k   alloc_root
    16k   JOIN::choose_subquery_plan
    10k   thd_ha_data
    10k   make_sortkey()
     9k   get_best_combination
     9k   handler::read_range_next
     7k   handler::ha_thd
     7k   update_ref_and_keys
     7k   Item::cmp_type
     6k   Item::operator delete
     5k   ha_maria::implicit_commit
     4k   end_write
     4k   sort_and_filter_keyuse
     4k   my_interval_timer
     4k   st_select_lex::master_unit
     4k   st_select_lex::save_leaf_tables
     4k   sel_arg_range_seq_next
     4k   thd_test_options
     4k   Item_func::walk
     4k   TABLE::init
     4k   sub_select
     3k   strnmov
     3k   ut_free
     3k   vio_write
     3k   ut_malloc_low
     2k   setup_tables
     2k   create_ref_for_key

Some care is needed to interpret these data. GCC -O3 does heavy
inlining. Oprofile is a statistical profiler, so numbers are somewhat
uncertain, more so for smaller numbers. And modern CPUs are superscalar,
out-of-order, and hyper-threaded, which makes interpreting individual
instruction costs higly challenging.

But the list can be used to drill down into the actual code; opannotate
--source --assembler is very enlightening, in my experience (few examples
below).

>From this list, it seems that - not surprising - the 5.3 optimiser features
seem to have added a bit of overhead. Much of this is spread out over many
places in the code, though it could be that the root cause comes from few
inlined functions.

I checked mysql_select(), and most of the overhead is from lp:941889,
expensive constructors in JOIN, which was inlined by GCC. I believe Sergey
plans to fix this. More analysis of the individual functions is needed to find
the cause there. It seems we are also using alloc_root() somewhat more.

thd_ha_data and maybe handler_thd are likely due to using dynamically loaded
innodb_plugin rather than built-in.

I would like to do more analysis (and would probably need help from
optimiser-knowledgable people for most of it) - I think it is feasible to
remove most of the overhead in a modest time frame.

But my main point is to present these results, to show that we have a very
useful tool to pin-point these performance issues, and to start discussion in
more detail (and I have plenty more of those).

-----------------------------------------------------------------------

More background:

These benchmarks started from these blog posts:

    http://www.mysqlperformanceblog.com/2012/02/18/mariadb-5-3-4-benchmarks/
    http://blog.montyprogram.com/benchmarking-mariadb-5-3-4/

The main difference in these tests seem to be caused by different XtraDB
buffer pool flush algorithm. So Axel tried using innodb_pluing in MariaDB
instead, and got much smaller differences - but a difference was still there.

So for 5.5 I took Axel's benchmark, adjusted until I got stable, fully
cpu-bound numbers, and analysed the results, as above. I have the actual
scripts and my.cnf available if anyone is interested. Main point is datadir in
ramdisk, reducing redo log size to 300MB total, and running for 30 seconds.

As examples, I analysed a handful of cases, and came up with two simple
patches for specific issues.

One is update_virtual_fields(), which originally accounted for 0.5% of total
time in sysbench. This seems to be called for every row, so just the push/pop
of registers in function prologue became significant. I fixed this with below
patch; it puts a NULL pointer into the table instead of an empty list of
virtual fields, and guards a few frequent calls with an if (NULL) check.

Another is my_hash_sort_simple(). This is not a regression, but Oprofile shows
we spend 10% of total time in this function.

opannotate of the assembler showed the problem. Below simple patch speeds it
up by 33%, by reducing 4 memory loads per loop to 2. Still, it would be nice
to understand why this is called that much - here is the call stack:

    my_hash_sort_simple()
    ha_heap::write_row().
    evaluate_join_record()
    sub_select()
    do_select()
    JOIN::exec()
    mysql_select()

It would be nice to understand why this is needed - this is an expensive
function, it needs to lookup every character in the collation tables. If this
could be eliminated, that alone would probably be enough to win over MySQL,
even without fixing any of the regressions.

Some other top contenders in the profile are innodb functions. For example
buf_calc_page_new_checksum(). This is expensive, as it needs to loop over
every byte in each buffer pool page (16K), computing a 7-long dependency chain
for each byte. This could be much faster if the checksum was done on 8-byte
longwords, but AFAIK, it's not trivial to change the checksum algorithm due to
the migration issues. Facebook may have done some work on this.

Another top contender is row_sel_store_mysql_rec() and similar - they show
high costs in the opannotate on loads from the records, this is probably cache
misses from accessing pages in the buffer pool - not too much to do about that
I think.

Hope this helps,

 - Kristian.

-----------------------------------------------------------------------
Patch for my_hash_sort_simple():

=== modified file 'strings/ctype-simple.c'
--- strings/ctype-simple.c	2012-01-13 14:50:02 +0000
+++ strings/ctype-simple.c	2012-03-16 16:34:31 +0000
@@ -306,19 +306,24 @@ void my_hash_sort_simple(CHARSET_INFO *c
 {
   register const uchar *sort_order=cs->sort_order;
   const uchar *end;
+  ulong n1, n2;
 
   /*
     Remove end space. We have to do this to be able to compare
     'A ' and 'A' as identical
   */
   end= skip_trailing_space(key, len);
-  
+
+  n1= *nr1;
+  n2= *nr2;
   for (; key < (uchar*) end ; key++)
   {
-    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * 
-	     ((uint) sort_order[(uint) *key])) + (nr1[0] << 8);
-    nr2[0]+=3;
+    n1^=(ulong) ((((uint) n1 & 63)+n2) *
+	     ((uint) sort_order[(uint) *key])) + (n1 << 8);
+    n2+=3;
   }
+  *nr1= n1;
+  *nr2= n2;
 }
 
 
-----------------------------------------------------------------------
Patch for update_virtual_fields():

=== modified file 'sql/filesort.cc'
--- sql/filesort.cc	2012-03-11 22:45:18 +0000
+++ sql/filesort.cc	2012-03-16 19:17:58 +0000
@@ -589,7 +589,7 @@ static ha_rows find_all_keys(SORTPARAM *
     {
       if ((error= select->quick->get_next()))
         break;
-      if (!error)
+      if (!error && sort_form->vfield)
         update_virtual_fields(thd, sort_form);
       file->position(sort_form->record[0]);
       DBUG_EXECUTE_IF("debug_filesort", dbug_print_record(sort_form, TRUE););
@@ -608,7 +608,7 @@ static ha_rows find_all_keys(SORTPARAM *
       else
       {
 	error=file->ha_rnd_next(sort_form->record[0]);
-	if (!error)
+	if (!error && sort_form->vfield)
 	  update_virtual_fields(thd, sort_form);
 	if (!flag)
 	{

=== modified file 'sql/records.cc'
--- sql/records.cc	2012-01-13 14:50:02 +0000
+++ sql/records.cc	2012-03-16 19:15:51 +0000
@@ -344,7 +344,8 @@ static int rr_quick(READ_RECORD *info)
       break;
     }
   }
-  update_virtual_fields(info->thd, info->table);
+  if (info->table->vfield)
+    update_virtual_fields(info->thd, info->table);
   return tmp;
 }
 

=== modified file 'sql/sql_select.cc'
--- sql/sql_select.cc	2012-03-11 22:45:18 +0000
+++ sql/sql_select.cc	2012-03-16 19:15:05 +0000
@@ -15824,7 +15824,8 @@ evaluate_join_record(JOIN *join, JOIN_TA
     DBUG_RETURN(NESTED_LOOP_KILLED);            /* purecov: inspected */
   }
 
-  update_virtual_fields(join->thd, join_tab->table);
+  if (join_tab->table->vfield)
+    update_virtual_fields(join->thd, join_tab->table);
 
   if (select_cond)
   {

=== modified file 'sql/table.cc'
--- sql/table.cc	2012-03-02 23:03:20 +0000
+++ sql/table.cc	2012-03-18 16:10:18 +0000
@@ -2466,30 +2466,35 @@ int open_table_from_share(THD *thd, TABL
   /*
     Process virtual columns, if any.
   */
-  if (!(vfield_ptr = (Field **) alloc_root(&outparam->mem_root,
-                                          (uint) ((share->vfields+1)*
-                                                  sizeof(Field*)))))
-    goto err;
-
-  outparam->vfield= vfield_ptr;
-  
-  for (field_ptr= outparam->field; *field_ptr; field_ptr++)
+  if (!share->vfields)
+    outparam->vfield= NULL;
+  else
   {
-    if ((*field_ptr)->vcol_info)
+    if (!(vfield_ptr = (Field **) alloc_root(&outparam->mem_root,
+                                             (uint) ((share->vfields+1)*
+                                                     sizeof(Field*)))))
+      goto err;
+
+    outparam->vfield= vfield_ptr;
+
+    for (field_ptr= outparam->field; *field_ptr; field_ptr++)
     {
-      if (unpack_vcol_info_from_frm(thd,
-                                    outparam,
-                                    *field_ptr,
-                                    &(*field_ptr)->vcol_info->expr_str,
-                                    &error_reported))
+      if ((*field_ptr)->vcol_info)
       {
-        error= 4; // in case no error is reported
-        goto err;
+        if (unpack_vcol_info_from_frm(thd,
+                                      outparam,
+                                      *field_ptr,
+                                      &(*field_ptr)->vcol_info->expr_str,
+                                      &error_reported))
+        {
+          error= 4; // in case no error is reported
+          goto err;
+        }
+        *(vfield_ptr++)= *field_ptr;
       }
-      *(vfield_ptr++)= *field_ptr;
     }
+    *vfield_ptr= 0;                              // End marker
   }
-  *vfield_ptr= 0;                              // End marker
 
 #ifdef WITH_PARTITION_STORAGE_ENGINE
   if (share->partition_info_str_len && outparam->file)

-----------------------------------------------------------------------
Follow ups

Re: Analysing MariaDB 5.5 sysbench performance regression
From: Axel Schwenke, 2012-03-31
Re: Analysing MariaDB 5.5 sysbench performance regression
From: MARK CALLAGHAN, 2012-03-21
Re: Analysing MariaDB 5.5 sysbench performance regression
From: Kristian Nielsen, 2012-03-20
Re: Analysing MariaDB 5.5 sysbench performance regression
From: Vladislav Vaintroub, 2012-03-20