maria-developers team mailing list archive

Thread
Date

Re: Next steps in improving single-threaded performance

To: Sergey Vojtovich <svoj@xxxxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Mon, 27 Jan 2014 10:14:07 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx, Axel Schwenke <axel@xxxxxxxxxxxxxxxx>
In-reply-to: <20140127084701.GB3439@june> (Sergey Vojtovich's message of "Mon, 27 Jan 2014 12:47:02 +0400")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

Sergey Vojtovich <svoj@xxxxxxxxxxx> writes:

> just out of curiosity: is it possible to find out which functions cause highest
> amount of icache misses?

Yes, see the second post, the profiles marked "Icache misses (ICACHE.MISSES),
before PGO" and "Icache misses (ICACHE.MISSES), after PGO". These are level 1
cache misses.

You will see that the functions with high cache miss rate are more or less the
same as the functions that execute a lot of instructions. Note however that
according to Intel documentation, there is a large skid on those events, so
one should not rely too much on the precise location reported.

>  Can it have anything to do with branch misprediction?

If you look at the same post, you will see profiles for
BR_MISP_RETIRED.ALL_BRANCHES_PS. This is a precise event, so it points
directly to the instruction after the mispredicted branch. We do get 12% or so
less mispredictions, so it has some effect. In comparison, we get 23% fewer
icache misses.

Note that the main source of branch misprediction is frequently called shared
library functions (due to the indirect jump in PLT), and virtual function
calls. This suggests that the problem here is that the sheer number of
branches executed causes eviction of otherwise correctly predicted branches.
We are simply executing too much code per request for the CPU to handle
efficiently, a common thing in server applications.

Another improvement that I noticed is in make_join_statistics(). PGO uses
calls to optimised memset() and memcpy() functions for large structure memory
writes, instead of byte-by-byte "rep movsb" sequences.

There are probably many small improvements that contribute to the overall
speedup spread out over the code, it is hard to determine precisely with such
a large code base. The reason I mention icache misses in particular is that

1. The performance counter measurements pre-PGO clearly shows that icache
   miss rate is the main bottleneck in the CPU.

2. PGO is well suited to reducing icache misses.

3. Indeed, measurements post-PGO show a significant reduction in icache
   misses.

 - Kristian.

Follow ups

Re: Next steps in improving single-threaded performance
From: Sergey Vojtovich, 2014-01-28

References

Next steps in improving single-threaded performance
From: Kristian Nielsen, 2014-01-24
Re: Next steps in improving single-threaded performance
From: Sergey Vojtovich, 2014-01-27