group.of.nepali.translators team mailing list archive

Thread
Date
[Bug 1663280] Re: Serious performance degradation of math functions

To: group.of.nepali.translators@xxxxxxxxxxxxxxxxxxx
From: Launchpad Bug Tracker <1663280@xxxxxxxxxxxxxxxxxx>
Date: Wed, 20 Feb 2019 15:52:18 -0000
Reply-to: Bug 1663280 <1663280@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
This bug was fixed in the package glibc - 2.23-0ubuntu11

---------------
glibc (2.23-0ubuntu11) xenial; urgency=medium

  * debian/patches/ubuntu/xsave-part1.diff and
    debian/patches/ubuntu/xsave-part2.diff: Fix a serious performance
    regression when mixing SSE and AVX code on certain processors.
    The patches are from the upstream 2.23 stable branch. (LP: #1663280)

 -- Daniel Axtens <daniel.axtens@xxxxxxxxxxxxx>  Thu, 04 Oct 2018
10:29:55 +1000

** Changed in: glibc (Ubuntu Xenial)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1663280

Title:
  Serious performance degradation of math functions

Status in GLibC:
  Fix Released
Status in glibc package in Ubuntu:
  Fix Released
Status in glibc source package in Xenial:
  Fix Released
Status in glibc source package in Zesty:
  Won't Fix
Status in glibc package in Fedora:
  Fix Released

Bug description:
  SRU Justification
  =================

  [Impact]

   * Severe performance hit on many maths-heavy workloads. For example,
  a user reports linpack performance of 13 Gflops on Trusty and Bionic
  and 3.9 Gflops on Xenial.

   * Because the impact is so large (>3x) and Xenial is supported until
  2021, the fix should be backported.

   * The fix avoids an AVX-SSE transition penalty. It stops
  _dl_runtime_resolve() from using AVX-256 instructions which touch the
  upper halves of various registers. This change means that the
  processor does not need to save and restore them.

  [Test Case]

  Firstly, you need a suitable Intel machine. Users report that Sandy
  Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I
  have been able to reproduce it on a Skylake CPU using a suitable Azure
  VM.

  Create the following C file, exp.c:

  #include <math.h>
  #include <stdio.h>

  int main () {
    double a, b;
    for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
    printf("%f\n", a);
    return 0;
  }

  $ gcc -O3 -march=x86-64 -o exp exp.c -lm

  With the current version of glibc:

  $ time ./exp
  ...
  real    0m1.349s
  user    0m1.349s

  
  $ time LD_BIND_NOW=1 ./exp
  ...
  real    0m0.625s
  user    0m0.621s

  Observe that LD_BIND_NOW makes a big difference as it avoids the call
  to _dl_runtime_resolve.

  With the proposed update:

  $ time ./exp
  ...
  real    0m0.625s
  user    0m0.621s

  
  $ time LD_BIND_NOW=1 ./exp
  ...

  real    0m0.631s
  user    0m0.631s

  Observe that the normal case is faster, and LD_BIND_NOW makes a
  negligible difference.

  [Regression Potential]

  glibc is the nightmare case for regressions as could affect pretty much
  anything, and this patch touches a key part (dynamic libraries).

  We can be fairly confident in the fix generally - it's in the glibc in
  Bionic, Debian and some RPM-based distros. The backport is based on
  the patches in the release/2.23/master branch in the upstream glibc
  repository, and the backport was straightforward.

  Obviously that doesn't remove all risk. There is also a fair bit of
  Ubuntu-specific patching in glibc so other distros are of limited
  value for ruling out bugs. So I have done the following testing, and
  I'm happy to do more as required. All testing has been done:
   - on an Azure VM (affected by the change), with proposed package
   - on a local VM (not affected by the change), with proposed package

   * Boot with the upgraded libc6.

   * Watch a youtube video in Firefox over VNC.

   * Build some C code (debuild of zlib).

   * Test Java by installing and running Eclipse.

  Autopkgtest also passes.

  [Original Description]

  Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25
  [2]. All Ubuntu versions starting from 16.04 are affected because they
  use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x)
  performance degradation of math functions (pow, exp/exp2/exp10,
  log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2,
  sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be
  reproduced on any AVX-capable x86-64 machine.

  @strikov: According to a quite reliable source [5] all AMD CPUs and
  latest Intel CPUs (Skylake and Knights Landing) don't suffer from
  AVX/SSE transition penalty. It means that the scope of this bug
  becomes smaller and includes only the following generations of Intel's
  CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still
  remains quite large though.

  @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix
  from upstream 2.24 branch (as Marcel pointed out, fix has been
  backported to 2.24 branch where Fedora took it successfully) if such
  synchronization will take place. Ubuntu 16.04 (the main target of this
  bug) uses Glibc 2.23 which hasn't been patched upstream and will
  suffer from performance degradation until we fix it manually.

  This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM
  registers used by AVX-256 instructions extend 128-bit registers used
  by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes
  SSE instruction after AVX-256 instruction it has to store upper half
  of the YMM register to the internal buffer and then restore it when
  execution returns back to AVX instructions. Store/restore is required
  because old-fashioned SSE knows nothing about the upper halves of its
  registers and may damage them. This store/restore operation is time
  consuming (several tens of clock cycles for each operation). To deal
  with this issue, Intel introduced AVX-128 instructions which operate
  on the same 128-bit XMM register as SSE but take into account upper
  halves of YMM registers. Hence, no store/restore required. Practically
  speaking, AVX-128 instructions is a new smart form of SSE instructions
  which can be used together with full-size AVX-256 instructions without
  any penalty. Intel recommends to use AVX-128 instructions instead of
  SSE instructions wherever possible. To sum things up, it's okay to mix
  SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256
  is allowed because both types of instructions are aware of 256-bit YMM
  registers. Mixing SSE with AVX-128 is okay because CPU can guarantee
  that the upper halves of YMM registers don't contain any meaningful
  data (how one can put it there without using AVX-256 instructions) and
  avoid doing store/restore operation (why to care about random trash in
  the upper halves of the YMM registers). It's not okay to mix SSE with
  AVX-256 due to the transition penalty. Scalar floating-point
  instructions used by routines mentioned above are implemented as a
  subset of SSE and AVX-128 instructions. They operate on a small
  fraction of 128-bit register but still considered SSE/AVX-128
  instruction. And they suffer from SSE/AVX transition penalty as well.

  Glibc inadvertently triggers a chain of AVX/SSE transition penalties
  due to inappropriate use of AVX-256 instructions inside
  _dl_runtime_resolve() procedure. By using AVX-256 instructions to
  push/pop YMM registers [4], Glibc makes CPU think that the upper
  halves of XMM registers contain meaningful data which needs to be
  preserved during execution of SSE instructions. With such a 'dirty'
  flag set every switch between SSE and AVX instructions (AVX-128 or
  AVX-256) leads to a time consuming store/restore procedure. This
  'dirty' flag never gets cleared during the whole program execution
  which leads to a serious overall slowdown. Fixed implementation [2] of
  _dl_runtime_resolve() procedure tries to avoid using AVX-256
  instructions if possible.

  Buggy _dl_runtime_resolve() gets called every time when dynamic linker
  tries to resolve a symbol (any symbol, not just ones mentioned above).
  It's enough for _dl_runtime_resolve() to be called just once to touch
  the upper halves of the YMM registers and provoke AVX/SSE transition
  penalty in the future. It's safe to say that all dynamically linked
  application call _dl_runtime_resolve() at least once which means that
  all of them may experience slowdown. Performance degradation takes
  place when such application mixes AVX and SSE instructions (switches
  from AVX to SSE or back).

  There are two types of math routines provided by libm:
  (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other)
  (b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others)

  For the former group of routines slowdown happens when they get called
  from SSE code (i.e. from the application compiled with -mno-avx)
  because SSE -> AVX transition takes place. For the latter one slowdown
  happens when routines get called from AVX code (i.e. from the
  application compiled with -mavx) because AVX -> SSE transition takes
  place. Both situations look realistic. SSE code gets generated by gcc
  to target x86-64 and AVX-optimized code gets generated by gcc
  -march=native on AVX-capable machines.

  ============================================================================

  Let's take one routine from the group (a) and try to reproduce the
  slowdown.

  #include <math.h>
  #include <stdio.h>

  int main () {
    double a, b;
    for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b);
    printf("%f\n", a);
    return 0;
  }

  $ gcc -O3 -march=x86-64 -o exp exp.c -lm

  $ time ./exp
  <..> 2.801s <..>

  $ time LD_BIND_NOW=1 ./exp
  <..> 0.660s <..>

  You can see that application demonstrates 4x better performance when
  _dl_runtime_resolve() doesn't get called. That's how serious the
  impact of AVX/SSE transition can be.

  ============================================================================

  Let's take one routine from the group (b) and try to reproduce the
  slowdown.

  #include <math.h>
  #include <stdio.h>

  int main () {
    double a, b;
    for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b);
    printf("%f\n", a);
    return 0;
  }

  # note that -mavx option has been passed
  $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm

  $ time ./pow
  <..> 4.157s <..>

  $ time LD_BIND_NOW=1 ./pow
  <..> 2.123s <..>

  You can see that application demonstrates 2x better performance when
  _dl_runtime_resolve() doesn't get called.

  ============================================================================

  [!] It's important to mention that the context of this bug might be
  even wider. After a call to buggy _dl_runtime_resolve() any transition
  between AVX-128 and SSE (otherwise legitimate) will suffer from
  performance degradation. Any application which mixes AVX-128 floating
  point code with SSE floating point code (e.g. by using external SSE-
  only library) will experience serious slowdown.

  [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495
  [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
  [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604
  [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx
  [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182
  [5] http://www.agner.org/optimize/blog/read.php?i=761#761

To manage notifications about this bug go to:
https://bugs.launchpad.net/glibc/+bug/1663280/+subscriptions