← Back to team overview

kernel-packages team mailing list archive

[Bug 605773] Re: Wrong kernel setting zone_reclaim_mode leads to performance problems

 

Andras Fabian, this bug was reported a while ago and there hasn't been
any activity in it recently. We were wondering if this is still an
issue? If so, could you please run the following command in the
development release from a Terminal
(Applications->Accessories->Terminal), as it will automatically gather
and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

When reporting bugs in the future please use apport by using 'ubuntu-
bug' and the name of the package affected. You can learn more about this
functionality at https://wiki.ubuntu.com/ReportingBugs.

Also, during your maintenance window, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11-rc5

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Once testing of the upstream kernel is complete, please mark this bug's
Status as Confirmed. Please let us know your results. Thank you for your
understanding.

** Tags added: needs-kernel-logs needs-upstream-testing

** Changed in: linux (Ubuntu)
       Status: Confirmed => Incomplete

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/605773

Title:
  Wrong kernel setting zone_reclaim_mode leads to performance problems

Status in “linux” package in Ubuntu:
  Incomplete

Bug description:
  Binary package hint: linux-image-server

  --------------------------------------------------
  Description:    Ubuntu 10.04 LTS
  Release:        10.04
  --------------------------------------------------
  linux-image-server version:
    Installed: 2.6.32.22.23
  --------------------------------------------------

  The background of this problem is - or how I discovered it - a
  migration of PostgreSQL database server from old hardware+old OS to a
  new hardware+new OS. Transition was no problem, but after we started
  using the server in production, we discovered a strange problem during
  nightly backups. The runtime of the backups went up from 2 1/2 hours
  to 6 1/2 hours (despite the fact, that the new hardware was designed
  to have much more power ... which positively showed up in most other
  tasks!).

  A longer research of the issue using the knowledge of many helpful
  guys on the PostgreSQL mailing list finally helped to find the reason
  for this slow down. It turned out to be a problem around the VM part
  of the kernel! Under some situations, where a lot of memory - for
  caching purposes - was consumed (which easily happens while backing up
  100 GByte DBs),  a congestion happened in the VM which slowed down the
  process dramatically.

  In depth analysis of many parts (vie /proc file system, ps, strace
  etc.) and comparing with settings on the old machines, I finally found
  an essential kernel setting, vm.zone_reclaim_mode, that was solely
  responsible for the issue. Luckily I could construct a simple test
  scenario (COPY-to-STDOU - exporting the data from a database table via
  stdout ... and writing this via pipe to the file system) where I could
  reproduce the issue. Our server had the value zone_reclaim_mode = 1
  set, whereas our old servers used zone_reclaim_mode = 0. By switching
  (via sysctl) this values back and forth, I could easily bring down the
  experimental export process to crouching speed, or let it run again.

  The complete path of the analysis can be viewed at the PostgreSQL mailing list here:
  (there ia also a description, how the problem can be reproduced, and what the many symptoms are)
  http://archives.postgresql.org/pgsql-general/2010-07/msg00267.php

  Now, the conclusion to use "zone_reclaim_mode = 0" on our type of hardware was further strengthened by a very interesting thread at LKML, where the kernel developer discussed potential issues with this setting. You can read it here:
  http://lkml.org/lkml/2009/5/12/586

  That discussion boils down to the fact, that for some reasons
  (described there in detail), the Linux kernel thinks on modern CPU
  architectures (out new Servers use Core i7 generation CPUs which are
  explicitly mentioned!) that it has a NUMA architecture. And for NUMA
  architectures it automatically enables "zone_reclaim_mode = 1" ...
  even though it is wrong, and not even recommended under many
  circumstances. Interestingly, even most posters at the LKML thread
  think, that it would be better to always(!) default this value to
  "zone_reclaim_mode = 0" instead of some automatic decision.

  Some more detail on what zone_reclaim_mode does can also be found here:
  http://www.linuxinsight.com/proc_sys_vm_zone_reclaim_mode.html

  Now, I don't know why this "defaulting to 0" is still not in the
  mainline kernels. That discussion from May 2009 at LKML died down, and
  obviously no one feeled responsible to commit the patches (even
  though, obvioulsy one of the guys had already prepared some!). BUT, I
  would ask the Ubuntu team, to maybe act on their own and provide a way
  in the Ubuntu 10.04 LTS to fix this issue (because, some reports on
  the net suggest, that "zone_reclaim_mode = 1" can do harm to
  performance in many ways)! And I believe, that I will not be the only
  PostgreSQL admin being affected by this issue!

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/605773/+subscriptions