← Back to team overview

kernel-packages team mailing list archive

[Bug 1470404] Re: Some workloads experience more measurement variation with scaling_governor=performance than ondemand

 

This bug was fixed in the package linux - 3.16.0-44.59

---------------
linux (3.16.0-44.59) utopic; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1472030

  [ Iyappan Subramanian ]

  * SAUCE: (no-up) drivers: net: xgene: fix: Out of order descriptor bytes
    read
    - LP: #1425576

  [ Upstream Kernel Changes ]

  * Revert "tools/vm: fix page-flags build"
    - LP: #1471170
  * NVMe: Add shutdown timeout as module parameter.
    - LP: #1465136
  * Drivers: hv: vmbus: Add support for VMBus panic notifier handler
    - LP: #1463584
  * Drivers: hv: vmbus: Correcting truncation error for constant
    HV_CRASH_CTL_CRASH_NOTIFY
    - LP: #1463584
  * KVM: nVMX: fix lifetime issues for vmcs02
    - LP: #1448269
  * KVM: nVMX: Fix nested vmexit ack intr before load vmcs01
    - LP: #1448269
  * mm/slab_common: support the slub_debug boot option on specific object
    size
    - LP: #1456952
  * kvm: x86: fix kvm_apic_has_events to check for NULL pointer
  * cpuidle: powernv: Populate cpuidle state details by querying the
    device-tree
    - LP: #1470404
  * cpuidle: powernv: Read target_residency value of idle states from DT if
    available
    - LP: #1470404
  * cpuidle: powernv: Avoid endianness conversions while parsing DT
    - LP: #1470404
  * cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
    - LP: #1470404
  * iio: adis16400: Report pressure channel scale
    - LP: #1471170
  * iio: adis16400: Use != channel indices for the two voltage channels
    - LP: #1471170
  * iio: adis16400: Compute the scan mask from channel indices
    - LP: #1471170
  * iio: adis16400: Remove unused variable
    - LP: #1471170
  * iio: adis16400: Fix burst mode
    - LP: #1471170
  * iio: adis16400: Fix burst transfer for adis16448
    - LP: #1471170
  * USB: serial: ftdi_sio: Add support for a Motion Tracker Development
    Board
    - LP: #1471170
  * iio: adc: twl6030-gpadc: Fix modalias
    - LP: #1471170
  * serial: imx: Fix DMA handling for IDLE condition aborts
    - LP: #1471170
  * usb: dwc3: gadget: Fix incorrect DEPCMD and DGCMD status macros
    - LP: #1471170
  * ALSA: usb-audio: Add mic volume fix quirk for Logitech Quickcam Fusion
    - LP: #1471170
  * n_tty: Fix auditing support for cannonical mode
    - LP: #1471170
  * drm/i915/hsw: Fix workaround for server AUX channel clock divisor
    - LP: #1471170
  * x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers
    - LP: #1471170
  * lib: Fix strnlen_user() to not touch memory after specified maximum
    - LP: #1471170
  * Input: elantech - fix detection of touchpads where the revision matches
    a known rate
    - LP: #1471170
  * ALSA: hda/realtek - Add a fixup for another Acer Aspire 9420
    - LP: #1471170
  * ALSA: usb-audio: add MAYA44 USB+ mixer control names
    - LP: #1471170
  * ALSA: usb-audio: fix missing input volume controls in MAYA44 USB(+)
    - LP: #1471170
  * USB: cp210x: add ID for HubZ dual ZigBee and Z-Wave dongle
    - LP: #1471170
  * Input: elantech - add new icbody type
    - LP: #1471170
  * MIPS: Fix enabling of DEBUG_STACKOVERFLOW
    - LP: #1471170
  * xfrm: fix a race in xfrm_state_lookup_byspi
    - LP: #1471170
  * kconfig: Fix warning "‘jump’ may be used uninitialized"
    - LP: #1471170
  * scripts/sortextable: suppress warning: `relocs_size' may be used
    uninitialized
    - LP: #1471170
  * thermal: step_wise: Revert optimization
    - LP: #1471170
  * MIPS: KVM: Do not sign extend on unsigned MMIO load
    - LP: #1471170
  * arch/x86/kvm/mmu.c: work around gcc-4.4.4 bug
    - LP: #1471170
  * net: core: Correct an over-stringent device loop detection.
    - LP: #1471170
  * net: phy: Allow EEE for all RGMII variants
    - LP: #1471170
  * net: dp83640: fix broken calibration routine.
    - LP: #1471170
  * net: dp83640: reinforce locking rules.
    - LP: #1471170
  * unix/caif: sk_socket can disappear when state is unlocked
    - LP: #1471170
  * xen/netback: Properly initialize credit_bytes
    - LP: #1471170
  * udp: fix behavior of wrong checksums
    - LP: #1471170
  * xen: netback: read hotplug script once at start of day.
    - LP: #1471170
  * ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()
    - LP: #1471170
  * bridge: disable softirqs around br_fdb_update to avoid lockup
    - LP: #1471170
  * drm/i915: Assume dual channel LVDS if pixel clock necessitates it
    - LP: #1471170
  * Btrfs: send, add missing check for dead clone root
    - LP: #1471170
  * Btrfs: send, don't leave without decrementing clone root's
    send_progress
    - LP: #1471170
  * btrfs: incorrect handling for fiemap_fill_next_extent return
    - LP: #1471170
  * btrfs: cleanup orphans while looking up default subvolume
    - LP: #1471170
  * iommu/vt-d: Allow RMRR on graphics devices too
    - LP: #1471170
  * iommu/vt-d: Fix passthrough mode with translation-disabled devices
    - LP: #1471170
  * ata: ahci_mvebu: Fix wrongly set base address for the MBus window
    setting
    - LP: #1471170
  * virtio_pci: Clear stale cpumask when setting irq affinity
    - LP: #1471170
  * irqchip: sunxi-nmi: Fix off-by-one error in irq iterator
    - LP: #1471170
  * pata_octeon_cf: fix broken build
    - LP: #1471170
  * Input: synaptics - add min/max quirk for Lenovo S540
    - LP: #1471170
  * drm/i915: Fix DDC probe for passive adapters
    - LP: #1471170
  * cfg80211: wext: clear sinfo struct before calling driver
    - LP: #1471170
  * mm/memory_hotplug.c: set zone->wait_table to null after freeing it
    - LP: #1471170
  * ring-buffer-benchmark: Fix the wrong sched_priority of producer
    - LP: #1471170
  * block: fix ext_dev_lock lockdep report
    - LP: #1471170
  * iser-target: Fix variable-length response error completion
    - LP: #1471170
  * iser-target: release stale iser connections
    - LP: #1471170
  * ALSA: hda - adding a DAC/pin preference map for a HP Envy TS machine
    - LP: #1471170
  * drm/mgag200: Reject non-character-cell-aligned mode widths
    - LP: #1471170
  * crypto: caam - fix uninitialized state->buf_dma field
    - LP: #1471170
  * crypto: caam - improve initalization for context state saves
    - LP: #1471170
  * crypto: caam - fix RNG buffer cache alignment
    - LP: #1471170
  * tracing: Have filter check for balanced ops
    - LP: #1471170
  * drm/radeon: fix freeze for laptop with Turks/Thames GPU.
    - LP: #1471170
  * Linux 3.16.7-ckt14
    - LP: #1471170

 -- Brad Figg <brad.figg@xxxxxxxxxxxxx>  Mon, 06 Jul 2015 17:48:28 -0700

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1470404

Title:
  Some workloads experience more measurement variation with
  scaling_governor=performance than ondemand

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Utopic:
  Fix Released
Status in linux source package in Vivid:
  Fix Released

Bug description:
  SRU Justification:
  [Impact]
  Certain workloads can exhibit a large variance in behavior due to how how cpus are idled on power8 systems.

  [Fix]

  For 3.16:
  74aa51b5ccd3975392e30d11820dc073c5f2cd32
  92c83ff5b42b109c94fdeee53cb31f674f776d75
  70734a786acfd1998e47d40df19cba5c29469bdf

  For 3.16, 3.19:
  78eaa10f027cf69f9bd409e64eaff902172b2327

  $ git describe 78eaa10f027cf69f9bd409e64eaff902172b2327
  v4.1-rc2-9-g78eaa10
  Once we rebase to something v4.1+ we'll have this fixed in Wily.

  [Test Case]
  Set the system with the SMT8 mode and scaling_governor=performance or ondemand.
  Run the workload 100 times.

  --

  == Comment: #0 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-15 21:30:31 ==
  ---Problem Description---
  Many workloads experience wide measurement variation, more with scaling_governor=performance than ondemand.

  Contact Information = wpeter@xxxxxxxxxx, farid@xxxxxxxxxx

  ---uname output---
  Linux c656f7n04 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:42:36 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux

  Machine Type = 20-core and 24-core Tuleta systems

  ---Debugger---
  A debugger is not configured

  ---Steps to Reproduce---
  Set the system with the SMT8 mode and scaling_governor=performance or ondemand.
  Run the workload 100 times.
  Get 100 data points and sort them.
  Compare the spread of results with two governor modes.
  The source and scripts to run a simple test case will be provided.

  Stack trace output:
   no

  Oops output:
   no

  Userspace tool common name: not sure what it is.

  Userspace rpm: ??

  The userspace tool has the following bit modes: These are 64-bit
  programs.

  System Dump Info:
    The system is not configured to capture a system dump.

  Userspace tool obtained from project website:  na

  *Additional Instructions for wpeter@xxxxxxxxxx, farid@xxxxxxxxxx:
  -Attach sysctl -a output output to the bug.
  -Attach ltrace and strace of userspace application.

  == Comment: #2 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-16 08:47:41 ==
  This problem has a number of variables we were trying to reduce:
  - endianness
  - operating system
  - kernel level
  - compiler

  Bob Walkup says he's seen the variability in a bunch of CPU-intensive
  test cases, in various languages, using various compilers, which would
  seem to eliminate the "compiler" variable.

  We had not looked at the performance governor setting to this point.
  Interesting results, and yet another variable to add to the above mix.
  Perhaps two more runs?  (LE-ondemand, LE-performance, BE-ondemand, BE-
  performance)

  == Comment: #3 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-16 08:50:09 ==
  Also, Bob says he can reproduce this with and without vectorization (the stalls move from the VSU to the FPU), and with and without floating point (the stalls move from the FPU to the FXU).  Very odd.

  == Comment: #4 - Andrea M. Davis <amdavis@xxxxxxxxxx> - 2015-04-16 10:10:01 ==
  Peter, what version of Ubuntu are you running?

  == Comment: #5 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 10:32:58 ==
  Andrea,

  Ubuntu 14.04.2 LTS.

  #uname -a
  Linux c656f7n04 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:42:36 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux

  #lsb_release -a
  No LSB modules are available.
  Distributor ID:	Ubuntu
  Description:	Ubuntu 14.04.2 LTS
  Release:	14.04
  Codename:	trusty

  == Comment: #6 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 10:50:11 ==
  There are a few more things we have tried.

  (1) For STREAM, it was originally compiled with gfotran and its
  corresponding OpenMP. I compiled it with xlf and its corresponding
  OpenMP. There is no difference in performance.

  (2) There was a concern about NUMA, meaning is it possible the CPU
  binding by OpenMP is incorrect so that there are remote memory
  accesses behind the scene? By disabling one DCM and using 10 or 12
  cores only in the other DCM, we can still see occasional drops in
  performance, although not often. We can conclude it is not due to
  NUMA.

  (3) Farid and I also tried out different scheduler parameters
  (sched_min_granularity_ns, sched_wakeup_granularity_ns,
  sched_latency_ns, and others) and matched the correponding the other
  distro's values, but did not see performance changes.

  (4) For the workload AMG2006, the use of scaling_performance=ondemand
  also reduces the spread of data significantly.

  (5) For all the above investigations, I used a 20-core Tuleta and a
  24-core Tuleta, although they are configured identically with Ubuntu
  14.04.2. I mean two systems paint a consistent picture.

  So far, we looked at compiler, NUMA, scheduler, memory test, CPU test,
  ST vs SMT, etc. There is a significant difference in variation between
  scaling_governor=performance and scaling_govenor=ondemand with the
  same application and system configurations.

  Hopefully, the data point us to the right direction, i.e., there could
  be some unexpected behaviour with the implementation of
  scaling_governor=performance.

  == Comment: #7 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 14:30:21 ==
  Note that Bob Walkup does not see the improvement using scaling_governor=ondemand on a borrowed POK lab system. However, he still suggested me to open a bug based on my findings. I guess he is not totally sure about the system he got.

  It would be good to have data independently collected by others to
  verify my observations.

  Bob's serial_loop.c program can be compiled and run very easily. The
  examination of data is straightforward too.

  == Comment: #10 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-17 16:33:38 ==
  I was able to reproduce the problem with the serial_loop test described in comment 1 (my system is Ubunu 15.04), however disabling the nap cpuidle state seemed to resolve the variance:

  cpupower idle-set -d 0

  Can others reproduce?   I am not sure why nap behavior would be any
  different w/ the performance governor though..   Note, to re-enable:
  cpupower idle-set -E

  == Comment: #11 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-20 13:09:34 ==
  (In reply to comment #10)
  > disabling the nap cpuidle
  > state seemed to resolve the variance:
  >
  > cpupower idle-set -d 0

  just want to clarify state0 is actually snooze, not nap:
  # cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
  snooze
  # cat /sys/devices/system/cpu/cpu0/cpuidle/state1/name
  Nap

  == Comment: #12 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-20 16:26:32 ==
  Jenifer, thanks for the suggestion.

  "cpupower idle-set -d 0" works for Bob's serial_loop.c program.

  There are 24 identical processes running serial_loop in parallel, each
  bound to one core. With 100 iterations, there are 2400 elapsed times
  collected for each run. Each elapsed time over 5 seconds is counted as
  an outlier.

  The following data were collected on a 24-core Tuleta system.

  Scaling_govenor = P(erformance) or O(ndemand)
  snooze (state0) = default (enabled) and disabled

  P and default                = 34-35 outliers
  P and snooze disabled = 0 outliers

  O and default               = 2-4 outliers
  O and snooze disabled = 0 outliers

  As you asked, why do we need to disable snooze in order to reduce
  measurement variation when scaling_governor=performance?

  == Comment: #13 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-20 16:40:46 ==
  Vaidy,  could your team comment on this?  In SMT8 mode, more measurement variation is seen using the performance governor compared to the ondemand governor when snooze is enabled, but disabling snooze seems to resolve the problem. Does it make sense that snooze impacts would be higher in performance mode?

  Stewart mentioned some latency improvements in the new 830 OPAL
  firmware, is that related to this type of sleep state wakeup?

  == Comment: #14 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-21 12:23:01 ==
  "cpupower idle-set -d 0" also fixes the measurement variation of STREAM on a 24-core Tuleta system.

  scaling_governor=performance and default snooze = 65 outliers out of
  400 runs.

  scaling_governor=performance and snooze disabled = 0 outlier out of
  400 runs.

  == Comment: #15 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-21 23:21:22 ==
  "cpupower idle-set -d 0" also fixes the measurement variation of AMG2006 on a 24-core Tuleta system.

  It means when scaling_governor=performance, disabling snooze (state0,
  shallow sleep) while still enabling Nap (state1, deep sleep) can
  stabilize measurements.

  Vaidy,  please help understand this behaviour.

  == Comment: #17 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-04-22 14:22:11 ==
  Hi Team,

  Interesting observation.  Let me give possible contributing factors:

  (a) When running on ondemand, cpu frequency changed from min to max including turbo frequencies.
  (b) When running performance governor, frequency is set to constantly run turbo.

  Based on temperature, CPU may not be able to sustain turbo since we
  are constantly running at the frequency and burning more power.  The
  variation could actually come from the fact that we the platform (OCC)
  could drop the frequency periodically due to over temperature.

  While running ondemand, turning down the power could help sustain the turbo frequency longer.
  Disabling snooze will further increase the power consumption and push for more variation at turbo frequency.

  Our systems are designed to run consistently at nominal frequency and
  hence I would suggest that you run your experiment by setting nominal
  frequency to all cores using performance governor+max limit or
  userspace governor.

  You could use "Throughput-performance profile" using tuned-adm for
  this purpose.

  If running in "Nominal" Frequency gives you consistent performance,
  then the above theory of turbo mode variation holds good.  We can
  confirm them with additional traces in cpufreq back-end driver code.
  We are currently improving our instrumentation to detect frequency
  variation and throttling.  This is a good scenario to validate our
  trace design as well.

  Let me know what you find.

  --Vaidy

  == Comment: #18 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-22 14:28:15 ==
  (In reply to comment #17)

  > Disabling snooze will further increase the power consumption and push for
  > more variation at turbo frequency.

  We actually see the opposite effect, disabling snooze makes the
  variability at turbo freq go away :)

  == Comment: #19 - Basu Vaidyanathan <basu@xxxxxxxxxx> - 2015-04-22 14:44:43 ==
  Additionally, this is not a problem when running BE kernel, on the same P8 configuration box. I suspect
  it is more to do with configuration settings on LE before we start pointing finger at the FW codepath
  when using Ubuntu LE.

  == Comment: #20 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-22 15:23:43 ==
  Bob is finding another distro LE does _not_ exhibit variation.

  This would seem to eliminate LE as the culprit.

  Looking at the settings of
  /sys/devices/system/cpu/cpu*/cpuidle/state0/disable, they all report
  "0", which I believe is the same as having "snooze" enabled, correct?
  That would seem to eliminate "snooze" in and of itself as a culprit,
  *at least with this kernel level (3.10.0-210.ael7a)*.

  I'm starting to suspect it's an issue with the kernel in Ubuntu
  (3.16...)

  == Comment: #21 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> -
  2015-04-22 15:31:41 ==

  Running at constant nominal frequency will help you eliminate turbo
  mode variation and focus on the Linux issues and root-cause faster.

  The behavior I described above is not a bug or problem in firmware.
  It is the expected and correct behavior where throttling can happen.
  I am only trying to help you to reduce the number of variables that is
  affecting this experiment.

  --Vaidy

  == Comment: #22 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-04-22 15:35:45 ==
  (In reply to comment #20)

  This is good input.  The other distro does not have fast-sleep
  support. We will have only snooze and nap.  On the Ubuntu system do
  you see /sys/devices/system/cpu/cpu*/cpuidle/state2/name ?

  Disabling fast-sleep state if present in your Ubuntu setup could help
  us to the next step.

  --Vaidy

  == Comment: #23 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 16:30:28 ==
  On the different distro LE system provided by Paul Clarke, the observed behavior is different than what I have seen on Ubuntu LE systems, but one of the tests ... the MPI-enabled simple loop ... shows huge timing variations core-to-core for nearly every job.  That system has 24 cores in smt8 mode

  ppc64_cpu --frequency
  Power Savings Mode: Dynamic, Favor Performance
  min:    3.961 GHz (cpu 175)
  max:    3.963 GHz (cpu 1)
  avg:    3.962 GHz

  and nearly every job provides output that looks like this :
  out.10:tmin = 3.757, tmax = 6.519 on rank 17, tavg = 5.126

  meaning that it takes anywhere from 3.757 to 6.519 seconds to get
  through the timed loop :

     MPI_Barrier(MPI_COMM_WORLD);
     t1 = MPI_Wtime();
     sum = 0.0;
     for (i=0; i<2000000000; i++) sum += ((double) (i%10));
     t2 = MPI_Wtime();
     elapsed = t2 - t1;

  There are no loads or stores in that loop ... there is a separate
  process bound to each core, and they work independently.  Additional
  instrumentation shows that the slow processes are in the run queue the
  whole time.

  So far, the other work loads that I have tried on the different distro
  LE system showed significantly lower timing variations than what I had
  recorder on Ubuntu LE ... but not this one.

  == Comment: #24 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 16:54:07 ==
  Just adding that on the same different distro LE system, after turning off SMT via the command : ppc64_cpu --smt=1, all instances of the simple loop test have outputs like this :

  tmin = 3.756, tmax = 3.757 on rank 5, tavg = 3.757

  in other words it takes the same time to complete the work in the loop
  on every core ... every time,  within the limits of what I have had
  the patience to check.

  == Comment: #25 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-22 17:03:16 ==
  Bob, the use of ST mode reduces variation on Ubuntu 14.04.2 as well.

  With SMT8 on another distro LE, I wonder whether "cpupower idle-set -d
  0" helps reduce variation for the MPI-enabled simple loop?

  Is it correct to say that both Ubuntu LE 14.04.2 (kernel 3.16.0) and
  another distro LE (kernel) exhibit variation?

  Vaidy, Ubuntu 14.4.2 does not have cpuidle/state2 (fastsleep state).

  == Comment: #26 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 17:11:42 ==
  I ran the command :

  [root@tuleta ~]# cpupower idle-set -d 0
  Idlestate 0 disabled on CPU 0
  Idlestate 0 disabled on CPU 1
  ...

  on the different distro LE system after setting the state back to
  smt8, and the timing variability is still there :

  out.2:tmin = 3.757, tmax = 9.010 on rank 4, tavg = 4.619
  out.3:tmin = 3.757, tmax = 11.518 on rank 2, tavg = 4.684
  out.4:tmin = 3.757, tmax = 9.398 on rank 3, tavg = 4.773

  Essentially every job is showing truly huge timing variations.

  == Comment: #27 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-22 17:24:46 ==
  Does it make any difference with "cpupower idle-set -d 1"? to disable Nap too?

  I think we only have snooze and Nap on LE.

  == Comment: #28 - Basu Vaidyanathan <basu@xxxxxxxxxx> - 2015-04-22 17:46:14 ==
  (In reply to comment #27)

  I have a p8 box running ubuntu 14.10 and I do see
  cat /sys/devices/system/cpu/cpu0/cpuidle/state2/name
  FastSleep

  == Comment: #29 - Preeti U. Murthy <preeti.murthy@xxxxxxxxxx> - 2015-04-23 06:01:57 ==
  I see that there are hotplug operations being carried out simultaneously with running the benchmark. If so, the performance degradation could be due to the tasks being not allowed to run on the freshly onlined cpus.

  I would suggest boot a system with all hardware threads and not do
  hotplug operations in order to keep the above issue away while
  verifying the performance of the benchmarks, if the intention is to
  profile the cpufreq governors.

  Regards
  Preeti U Murthy

  == Comment: #31 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-28 00:27:52 ==
  On Ubuntu 14.04.2, there are two states in cpuidle: snooze and Nap.

  Are the enabling and disabling of these two states independent?

  == Comment: #32 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-28 16:16:23 ==
  Adding an observation on ubuntu le systems, using the simple-loop example above and the userspace governor (chosen so that one can set the frequency to a desired value).  When  using one thread per core with the system in SMT8 state, the time for the loop varies from ~3.7 sec to over 8 sec.  However, if a lot of iterations (10-20) of the same loop are done before starting the timed section of the code (adding a warmup loop), the variations in the timed section are dramatically reduced.  There are still some outliers, but a much smaller number of them; and the timing spread is a fraction of one second, instead of several seconds.  So there is a clear dependence on history, with the largest timing variations occurring immediately after job startup.  I should mention that this remains a problem for many performance benchmarks in the HPC area, which often run in a total time of less than one minute.  I would hope that with the userspace governor, or the performance governor, the power and frequency settings would remain constant.  Can someone confirm that?

  == Comment: #33 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-29 17:16:58 ==
  Vaidy, would you help answer my question on Comment 31?

  == Comment: #34 - George A. Chochia <chochia@xxxxxxxxxx> - 2015-05-13 11:52:53 ==
  Vaidy, I am currently seeing a 2.5x performance degradation in the Message Rate benchmark on p8, Ubuntu 14.04.02 LE.

  Performance was normal back in February, when we had 14.04.01 and
  older FW.

  The degradation goes away once snooze state is disabled. There have
  been two FW updates: 1/13 and 2/17.

  == Comment: #35 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-13 14:35:37 ==
  (In reply to comment #31)
  > On Ubuntu 14.04.2, there are two states in cpuidle: snooze and Nap.
  >
  > Are the enabling and disabling of these two states independent?

  Hi Peter,

  Yes the enable/disable for idle states are independent.  Atleast 1
  idle state is expected to be enabled, if not the CPU may busy loop at
  idle and not reduce the thread priority like snooze.

  You can disable snooze and have nap enabled or the other way, but
  having both disabled will lead to idle threads burning more cycles.

  --Vaidy

  == Comment: #36 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-13 14:58:07 ==
  (In reply to comment #34)

  Hi George,

  The idle state management code is same for both the kernels.  You have
  only snooze and nap as idle states right?

  As I explained over email, when snooze and nap are enabled, the
  cpuidle logic should choose nap for idle sibling threads after a short
  period in snooze.

  Can you guys analyse and confirm that following points:

  * Workloads is run on primary thread on each core always
  * Remaining 7 sibling threads should be in nap (state1)
  * Time spend in 'nap' state for each of the sibling threads can be obtained from sysfs
  /sys/devices/system/cpu/cpuN/cpuidle/state1/time (unit is micro secs)
  * Workload variation is related to nap residency of sibling threads on that core

  If the nap residency (time spent in nap) is not uniform then workload
  performance would be proportionally non uniform.

  The above statement (if proven) is one possible root-cause, that can
  help us move forward and design a fix.

  --Vaidy

  == Comment: #37 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-05-13 17:45:33 ==
  Hi Vaidy,

  Let's use Bob's serial_loop.c as an example. There are 24 copies of
  his program running on 24 cores in parallel. Only the primary threads
  of the cores are used.

  Did Shilpa use Bob's program to re-create the problem and find out
  that some unused sibling threads do not sleep fast enough and take
  away cycles from the primary thread to cause variability?

  It is great to know that we can study the sleep time by examining the
  /sys/devices/system/cpu/cpuN/cpuidle/state1/time. Did Shilpa use this
  method to come up with the above understanding?

  Based on George's finding, do you know whether there are thermal code
  changes in the old firmware that affects the thermal behavior in the
  current version?

  Thanks,
  Peter

  == Comment: #38 - Preeti U. Murthy <preeti.murthy@xxxxxxxxxx> - 2015-05-13 23:24:18 ==
  Is this really related to snooze ? Jennifer mentioned in Comment 10 that disabling nap and not snooze also reduced the variance ? Can you please confirm if this is the case ? This will help us narrow down on the issue.

  Regards
  Preeti U Murthy

  == Comment: #39 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-14 10:19:09 ==
  (In reply to comment #38)
  Hi Preeti, sorry I corrected myself in comment 11, I was disabling state0 which is snooze, not nap:
  # cpupower idle-set -d 0
  # cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
  snooze

  Still might be interesting to try some tests w/ nap disabled.

  == Comment: #40 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-05-14 11:15:45 ==
  (In reply to comment #37)
  Yes . I also used perf-trace events to get the same info.

  Regards,
  Shilpa

  == Comment: #42 - Anton Blanchard <antonb@xxxxxxxxxxx> - 2015-05-19 19:40:45 ==
  If I am reading that trace right, we spent over 200ms in snooze on a secondary thread of a badly performing core. That is an enormous amount of time to be chewing up the core.

  == Comment: #43 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-05-19 21:45:20 ==
  Vaidy,

  Could you provide more information on your proposed solution which is
  in the kernel, not in OPAL?

  Does it mean that you need to upstream different patches to set of
  kernels for Ubuntu and other distro?

  Peter

  == Comment: #44 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-20 10:56:48 ==
  (In reply to comment #42)
  Hi Anton,

  That is right, exit from snooze state is the problem.  In the proposed
  fix Shilpa has added a forced exit from snooze loop after the target
  residency so that the cpuidle governor can select nap.

  We have to rewrite the snooze loop and exit after the first interrupt
  or timer or after after target residency (100us) so that the idle
  state promotion can happen.

  --Vaidy

  == Comment: #45 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-05-20 11:02:06 ==
   Hi,

  I am sharing the link for ubuntu kernel packages with the fix:

  1) http://kernel.stglabs.ibm.com/~shilpa/ubuntu-14-04.tar
      This file contains the following packages:
      a)linux-headers-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
      b)linux-image-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
      c)linux-image-extra-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
      d)linux-tools-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
      The fix is based on top of ubuntu-14.-04.02 3.16.0-38-generic + upstream commit (92c83ff5b42b  cpuidle: powernv: Read target_residency value of idle states from DT if available)

  2) http://kernel.stglabs.ibm.com/~shilpa/ubuntu-15.04.tar
      This file contains the following packages:
      linux-headers-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
      linux-image-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
      linux-image-extra-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
      linux-tools-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
      The fix is based on top of ubuntu-15.04 3.19.0-17-generic

  == Comment: #46 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-20 11:21:07 ==
  (In reply to comment #43)

  Hi Peter,

  Sure.  As per our discussion yesterday, we agreed on the following:

  * The issue is not machine specific, the problem was recreated by
  Jenifer on S822L also even though other teams believe the issue is
  S824L specific.

  * The key issue observed is the sibling thread's snooze time variation
  which chews cycles from primary thread.

  * The fix is to force exit snooze loop after target residency (100us)
  and allow the cpuidle governor to enter nap.

  * This fix is completely in Linux kernel cpuidle driver code and does
  not require change in OPAL.

  Yes, once we verify the solution, we will design the correct idle
  state auto-promotion logic in cpuidle driver and get it upstream and
  then push to the other distro and ubuntu distros that run bare-metal.

  --Vaidy

  == Comment: #47 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-20 12:44:17 ==
  I tested Shilpa's kernel packages w/ the fix and can confirm I no longer see the variation issue w/ the serial loop program running on primary threads in SMT8 mode when the performance governor is set.   I will get with Peter to test with another benchmark that previously hit the variation issue.

  ----

  System:
  8247-42L
  20 cores, SMT8
  FW830_041
  Ubuntu 15.04

  Run script:
  #!/bin/bash

  for iter in `seq 1 100`
  do
    for cpu in 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152
    do
    taskset -c ${cpu} ./serial_loop > out.${cpu}.${iter} &
    done
    echo $iter
    wait
  done

  Results:

  -- 3.19.0-17 fix --

  Performance
  -----------
  Loop elapsed:		User time:
  Min	Max		Min	Max
  3.885	3.92		3.877	3.914
  3.885	3.892		3.877	3.886
  3.885	3.908		3.877	3.901

  Ondemand
  --------
  Loop elapsed:		User time:
  Min	Max		Min	Max
  3.933	3.949		3.901	3.912

  -- orig 3.19.0-16 kernel --

  Performance
  -----------
  Loop elapsed:		User time:
  Min	Max		Min	Max
  3.886	4.507		3.88	4.498
  3.884	10.404		3.877	10.39

  Ondemand
  --------
  Loop elapsed:		User time:
  Min	Max		Min	Max
  3.932	3.994		3.901	3.959

  == Comment: #49 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-21 18:59:33 ==
  The fix from comment #45 also resolves large variance issues w/ STREAM and DGEMM workloads. Results listed below.

  =========================================
  STREAM:

  MB/sec
  SMT8, 1 thread per core, 100 loop

  -------- orig 3.19.0-16 kernel --------

  Performance:
  ____________
   Min		Max		%diff
  run1:	304384.6341	308199.3341	1.25%
  run2: 	150096.0562	308516.5557	69.09%

  Performance
  + disable snooze:
  _________________
   Min		Max		%diff
  run1:	305700.3257	308403.9185	0.88%
  run2: 	305547.2215	308771.2772	1.05%

  Ondemand:
  _________
   Min		Max		%diff
  run1:	298386.1295	302209.7456	1.27%

  ----------- 3.19.0-17 fix -----------

  Performance:
  ____________
   Min		Max		%diff
  run1:	303486.8368	308433.0545	1.62%
  run2: 	304768.6159	308410.2177	1.19%
  run3:	304723.2556	308847.065	1.34%

  Ondemand:
  _________
   Min		Max		%diff
  run1:	297364.385	302473.0888	1.70%

  =========================================

  =========================================
  DGEMM:

  GFlops
  SMT8, 1 thread per core, 20 loop

  -------- orig 3.19.0-16 kernel --------

  Performance:
  ____________
   Min		Max		%diff
  run1:	479.53		520.2		8.14%

  Performance
  + disable snooze:
  _________________
   Min		Max		%diff
  run1:	511.18		520.49		1.80%

  Ondemand:
  _________
   Min		Max		%diff
  run1:	505.64		509.88		0.84%

  ----------- 3.19.0-17 fix -----------

  Performance:
  ____________
   Min		Max		%diff
  run1:	512.77		520.84		1.56%
  run2: 	517.19		520.34		0.61%
  run3:	517.93		520.35		0.47%

  Ondemand:
  _________
   Min		Max		%diff
  run1:	505.72		508.53		0.55%

  == Comment: #51 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-06-14 22:53:05 ==
  Vaidy, is this fix being reviewed by the Linux kernel community? Can you give some estimates as to when this kernel fix will get into mainline and also when it will get into Ubuntu distro?

  == Comment: #52 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-06-24 07:18:28 ==
  The patch can be found in the upstream kernel 4.2
  78eaa10f027c cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1470404/+subscriptions