← Back to team overview

maria-developers team mailing list archive

Re: Next steps in improving single-threaded performance


Hi Kristian,

yes, the second post answers most of my questions. Somehow I missed it, sorry.

Still a questions mostly to educate myself. According to proc mysqld executable
size is something like:
VmExe:	   12228 kB
VmLib:	    6272 kB

I assume the above refers to overall instructions. Level 1 instruction cache
size is like 32Kb, right?

When you say that we're executing too much code per request, did you mean the

Do you think we can get similar speedup by putting compiler hints (e.g.
likely/unlikely) and code optimizations?


On Mon, Jan 27, 2014 at 10:14:07AM +0100, Kristian Nielsen wrote:
> Sergey Vojtovich <svoj@xxxxxxxxxxx> writes:
> > just out of curiosity: is it possible to find out which functions cause highest
> > amount of icache misses?
> Yes, see the second post, the profiles marked "Icache misses (ICACHE.MISSES),
> before PGO" and "Icache misses (ICACHE.MISSES), after PGO". These are level 1
> cache misses.
> You will see that the functions with high cache miss rate are more or less the
> same as the functions that execute a lot of instructions. Note however that
> according to Intel documentation, there is a large skid on those events, so
> one should not rely too much on the precise location reported.
> >  Can it have anything to do with branch misprediction?
> If you look at the same post, you will see profiles for
> BR_MISP_RETIRED.ALL_BRANCHES_PS. This is a precise event, so it points
> directly to the instruction after the mispredicted branch. We do get 12% or so
> less mispredictions, so it has some effect. In comparison, we get 23% fewer
> icache misses.
> Note that the main source of branch misprediction is frequently called shared
> library functions (due to the indirect jump in PLT), and virtual function
> calls. This suggests that the problem here is that the sheer number of
> branches executed causes eviction of otherwise correctly predicted branches.
> We are simply executing too much code per request for the CPU to handle
> efficiently, a common thing in server applications.
> Another improvement that I noticed is in make_join_statistics(). PGO uses
> calls to optimised memset() and memcpy() functions for large structure memory
> writes, instead of byte-by-byte "rep movsb" sequences.
> There are probably many small improvements that contribute to the overall
> speedup spread out over the code, it is hard to determine precisely with such
> a large code base. The reason I mention icache misses in particular is that
> 1. The performance counter measurements pre-PGO clearly shows that icache
>    miss rate is the main bottleneck in the CPU.
> 2. PGO is well suited to reducing icache misses.
> 3. Indeed, measurements post-PGO show a significant reduction in icache
>    misses.
>  - Kristian.

Follow ups