kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #127666
[Bug 1470404] Re: Some workloads experience more measurement variation with scaling_governor=performance than ondemand
This bug was fixed in the package linux - 3.16.0-44.59
---------------
linux (3.16.0-44.59) utopic; urgency=low
[ Brad Figg ]
* Release Tracking Bug
- LP: #1472030
[ Iyappan Subramanian ]
* SAUCE: (no-up) drivers: net: xgene: fix: Out of order descriptor bytes
read
- LP: #1425576
[ Upstream Kernel Changes ]
* Revert "tools/vm: fix page-flags build"
- LP: #1471170
* NVMe: Add shutdown timeout as module parameter.
- LP: #1465136
* Drivers: hv: vmbus: Add support for VMBus panic notifier handler
- LP: #1463584
* Drivers: hv: vmbus: Correcting truncation error for constant
HV_CRASH_CTL_CRASH_NOTIFY
- LP: #1463584
* KVM: nVMX: fix lifetime issues for vmcs02
- LP: #1448269
* KVM: nVMX: Fix nested vmexit ack intr before load vmcs01
- LP: #1448269
* mm/slab_common: support the slub_debug boot option on specific object
size
- LP: #1456952
* kvm: x86: fix kvm_apic_has_events to check for NULL pointer
* cpuidle: powernv: Populate cpuidle state details by querying the
device-tree
- LP: #1470404
* cpuidle: powernv: Read target_residency value of idle states from DT if
available
- LP: #1470404
* cpuidle: powernv: Avoid endianness conversions while parsing DT
- LP: #1470404
* cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
- LP: #1470404
* iio: adis16400: Report pressure channel scale
- LP: #1471170
* iio: adis16400: Use != channel indices for the two voltage channels
- LP: #1471170
* iio: adis16400: Compute the scan mask from channel indices
- LP: #1471170
* iio: adis16400: Remove unused variable
- LP: #1471170
* iio: adis16400: Fix burst mode
- LP: #1471170
* iio: adis16400: Fix burst transfer for adis16448
- LP: #1471170
* USB: serial: ftdi_sio: Add support for a Motion Tracker Development
Board
- LP: #1471170
* iio: adc: twl6030-gpadc: Fix modalias
- LP: #1471170
* serial: imx: Fix DMA handling for IDLE condition aborts
- LP: #1471170
* usb: dwc3: gadget: Fix incorrect DEPCMD and DGCMD status macros
- LP: #1471170
* ALSA: usb-audio: Add mic volume fix quirk for Logitech Quickcam Fusion
- LP: #1471170
* n_tty: Fix auditing support for cannonical mode
- LP: #1471170
* drm/i915/hsw: Fix workaround for server AUX channel clock divisor
- LP: #1471170
* x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers
- LP: #1471170
* lib: Fix strnlen_user() to not touch memory after specified maximum
- LP: #1471170
* Input: elantech - fix detection of touchpads where the revision matches
a known rate
- LP: #1471170
* ALSA: hda/realtek - Add a fixup for another Acer Aspire 9420
- LP: #1471170
* ALSA: usb-audio: add MAYA44 USB+ mixer control names
- LP: #1471170
* ALSA: usb-audio: fix missing input volume controls in MAYA44 USB(+)
- LP: #1471170
* USB: cp210x: add ID for HubZ dual ZigBee and Z-Wave dongle
- LP: #1471170
* Input: elantech - add new icbody type
- LP: #1471170
* MIPS: Fix enabling of DEBUG_STACKOVERFLOW
- LP: #1471170
* xfrm: fix a race in xfrm_state_lookup_byspi
- LP: #1471170
* kconfig: Fix warning "‘jump’ may be used uninitialized"
- LP: #1471170
* scripts/sortextable: suppress warning: `relocs_size' may be used
uninitialized
- LP: #1471170
* thermal: step_wise: Revert optimization
- LP: #1471170
* MIPS: KVM: Do not sign extend on unsigned MMIO load
- LP: #1471170
* arch/x86/kvm/mmu.c: work around gcc-4.4.4 bug
- LP: #1471170
* net: core: Correct an over-stringent device loop detection.
- LP: #1471170
* net: phy: Allow EEE for all RGMII variants
- LP: #1471170
* net: dp83640: fix broken calibration routine.
- LP: #1471170
* net: dp83640: reinforce locking rules.
- LP: #1471170
* unix/caif: sk_socket can disappear when state is unlocked
- LP: #1471170
* xen/netback: Properly initialize credit_bytes
- LP: #1471170
* udp: fix behavior of wrong checksums
- LP: #1471170
* xen: netback: read hotplug script once at start of day.
- LP: #1471170
* ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()
- LP: #1471170
* bridge: disable softirqs around br_fdb_update to avoid lockup
- LP: #1471170
* drm/i915: Assume dual channel LVDS if pixel clock necessitates it
- LP: #1471170
* Btrfs: send, add missing check for dead clone root
- LP: #1471170
* Btrfs: send, don't leave without decrementing clone root's
send_progress
- LP: #1471170
* btrfs: incorrect handling for fiemap_fill_next_extent return
- LP: #1471170
* btrfs: cleanup orphans while looking up default subvolume
- LP: #1471170
* iommu/vt-d: Allow RMRR on graphics devices too
- LP: #1471170
* iommu/vt-d: Fix passthrough mode with translation-disabled devices
- LP: #1471170
* ata: ahci_mvebu: Fix wrongly set base address for the MBus window
setting
- LP: #1471170
* virtio_pci: Clear stale cpumask when setting irq affinity
- LP: #1471170
* irqchip: sunxi-nmi: Fix off-by-one error in irq iterator
- LP: #1471170
* pata_octeon_cf: fix broken build
- LP: #1471170
* Input: synaptics - add min/max quirk for Lenovo S540
- LP: #1471170
* drm/i915: Fix DDC probe for passive adapters
- LP: #1471170
* cfg80211: wext: clear sinfo struct before calling driver
- LP: #1471170
* mm/memory_hotplug.c: set zone->wait_table to null after freeing it
- LP: #1471170
* ring-buffer-benchmark: Fix the wrong sched_priority of producer
- LP: #1471170
* block: fix ext_dev_lock lockdep report
- LP: #1471170
* iser-target: Fix variable-length response error completion
- LP: #1471170
* iser-target: release stale iser connections
- LP: #1471170
* ALSA: hda - adding a DAC/pin preference map for a HP Envy TS machine
- LP: #1471170
* drm/mgag200: Reject non-character-cell-aligned mode widths
- LP: #1471170
* crypto: caam - fix uninitialized state->buf_dma field
- LP: #1471170
* crypto: caam - improve initalization for context state saves
- LP: #1471170
* crypto: caam - fix RNG buffer cache alignment
- LP: #1471170
* tracing: Have filter check for balanced ops
- LP: #1471170
* drm/radeon: fix freeze for laptop with Turks/Thames GPU.
- LP: #1471170
* Linux 3.16.7-ckt14
- LP: #1471170
-- Brad Figg <brad.figg@xxxxxxxxxxxxx> Mon, 06 Jul 2015 17:48:28 -0700
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1470404
Title:
Some workloads experience more measurement variation with
scaling_governor=performance than ondemand
Status in linux package in Ubuntu:
In Progress
Status in linux source package in Utopic:
Fix Released
Status in linux source package in Vivid:
Fix Released
Bug description:
SRU Justification:
[Impact]
Certain workloads can exhibit a large variance in behavior due to how how cpus are idled on power8 systems.
[Fix]
For 3.16:
74aa51b5ccd3975392e30d11820dc073c5f2cd32
92c83ff5b42b109c94fdeee53cb31f674f776d75
70734a786acfd1998e47d40df19cba5c29469bdf
For 3.16, 3.19:
78eaa10f027cf69f9bd409e64eaff902172b2327
$ git describe 78eaa10f027cf69f9bd409e64eaff902172b2327
v4.1-rc2-9-g78eaa10
Once we rebase to something v4.1+ we'll have this fixed in Wily.
[Test Case]
Set the system with the SMT8 mode and scaling_governor=performance or ondemand.
Run the workload 100 times.
--
== Comment: #0 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-15 21:30:31 ==
---Problem Description---
Many workloads experience wide measurement variation, more with scaling_governor=performance than ondemand.
Contact Information = wpeter@xxxxxxxxxx, farid@xxxxxxxxxx
---uname output---
Linux c656f7n04 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:42:36 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
Machine Type = 20-core and 24-core Tuleta systems
---Debugger---
A debugger is not configured
---Steps to Reproduce---
Set the system with the SMT8 mode and scaling_governor=performance or ondemand.
Run the workload 100 times.
Get 100 data points and sort them.
Compare the spread of results with two governor modes.
The source and scripts to run a simple test case will be provided.
Stack trace output:
no
Oops output:
no
Userspace tool common name: not sure what it is.
Userspace rpm: ??
The userspace tool has the following bit modes: These are 64-bit
programs.
System Dump Info:
The system is not configured to capture a system dump.
Userspace tool obtained from project website: na
*Additional Instructions for wpeter@xxxxxxxxxx, farid@xxxxxxxxxx:
-Attach sysctl -a output output to the bug.
-Attach ltrace and strace of userspace application.
== Comment: #2 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-16 08:47:41 ==
This problem has a number of variables we were trying to reduce:
- endianness
- operating system
- kernel level
- compiler
Bob Walkup says he's seen the variability in a bunch of CPU-intensive
test cases, in various languages, using various compilers, which would
seem to eliminate the "compiler" variable.
We had not looked at the performance governor setting to this point.
Interesting results, and yet another variable to add to the above mix.
Perhaps two more runs? (LE-ondemand, LE-performance, BE-ondemand, BE-
performance)
== Comment: #3 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-16 08:50:09 ==
Also, Bob says he can reproduce this with and without vectorization (the stalls move from the VSU to the FPU), and with and without floating point (the stalls move from the FPU to the FXU). Very odd.
== Comment: #4 - Andrea M. Davis <amdavis@xxxxxxxxxx> - 2015-04-16 10:10:01 ==
Peter, what version of Ubuntu are you running?
== Comment: #5 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 10:32:58 ==
Andrea,
Ubuntu 14.04.2 LTS.
#uname -a
Linux c656f7n04 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:42:36 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
#lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty
== Comment: #6 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 10:50:11 ==
There are a few more things we have tried.
(1) For STREAM, it was originally compiled with gfotran and its
corresponding OpenMP. I compiled it with xlf and its corresponding
OpenMP. There is no difference in performance.
(2) There was a concern about NUMA, meaning is it possible the CPU
binding by OpenMP is incorrect so that there are remote memory
accesses behind the scene? By disabling one DCM and using 10 or 12
cores only in the other DCM, we can still see occasional drops in
performance, although not often. We can conclude it is not due to
NUMA.
(3) Farid and I also tried out different scheduler parameters
(sched_min_granularity_ns, sched_wakeup_granularity_ns,
sched_latency_ns, and others) and matched the correponding the other
distro's values, but did not see performance changes.
(4) For the workload AMG2006, the use of scaling_performance=ondemand
also reduces the spread of data significantly.
(5) For all the above investigations, I used a 20-core Tuleta and a
24-core Tuleta, although they are configured identically with Ubuntu
14.04.2. I mean two systems paint a consistent picture.
So far, we looked at compiler, NUMA, scheduler, memory test, CPU test,
ST vs SMT, etc. There is a significant difference in variation between
scaling_governor=performance and scaling_govenor=ondemand with the
same application and system configurations.
Hopefully, the data point us to the right direction, i.e., there could
be some unexpected behaviour with the implementation of
scaling_governor=performance.
== Comment: #7 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 14:30:21 ==
Note that Bob Walkup does not see the improvement using scaling_governor=ondemand on a borrowed POK lab system. However, he still suggested me to open a bug based on my findings. I guess he is not totally sure about the system he got.
It would be good to have data independently collected by others to
verify my observations.
Bob's serial_loop.c program can be compiled and run very easily. The
examination of data is straightforward too.
== Comment: #10 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-17 16:33:38 ==
I was able to reproduce the problem with the serial_loop test described in comment 1 (my system is Ubunu 15.04), however disabling the nap cpuidle state seemed to resolve the variance:
cpupower idle-set -d 0
Can others reproduce? I am not sure why nap behavior would be any
different w/ the performance governor though.. Note, to re-enable:
cpupower idle-set -E
== Comment: #11 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-20 13:09:34 ==
(In reply to comment #10)
> disabling the nap cpuidle
> state seemed to resolve the variance:
>
> cpupower idle-set -d 0
just want to clarify state0 is actually snooze, not nap:
# cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
snooze
# cat /sys/devices/system/cpu/cpu0/cpuidle/state1/name
Nap
== Comment: #12 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-20 16:26:32 ==
Jenifer, thanks for the suggestion.
"cpupower idle-set -d 0" works for Bob's serial_loop.c program.
There are 24 identical processes running serial_loop in parallel, each
bound to one core. With 100 iterations, there are 2400 elapsed times
collected for each run. Each elapsed time over 5 seconds is counted as
an outlier.
The following data were collected on a 24-core Tuleta system.
Scaling_govenor = P(erformance) or O(ndemand)
snooze (state0) = default (enabled) and disabled
P and default = 34-35 outliers
P and snooze disabled = 0 outliers
O and default = 2-4 outliers
O and snooze disabled = 0 outliers
As you asked, why do we need to disable snooze in order to reduce
measurement variation when scaling_governor=performance?
== Comment: #13 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-20 16:40:46 ==
Vaidy, could your team comment on this? In SMT8 mode, more measurement variation is seen using the performance governor compared to the ondemand governor when snooze is enabled, but disabling snooze seems to resolve the problem. Does it make sense that snooze impacts would be higher in performance mode?
Stewart mentioned some latency improvements in the new 830 OPAL
firmware, is that related to this type of sleep state wakeup?
== Comment: #14 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-21 12:23:01 ==
"cpupower idle-set -d 0" also fixes the measurement variation of STREAM on a 24-core Tuleta system.
scaling_governor=performance and default snooze = 65 outliers out of
400 runs.
scaling_governor=performance and snooze disabled = 0 outlier out of
400 runs.
== Comment: #15 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-21 23:21:22 ==
"cpupower idle-set -d 0" also fixes the measurement variation of AMG2006 on a 24-core Tuleta system.
It means when scaling_governor=performance, disabling snooze (state0,
shallow sleep) while still enabling Nap (state1, deep sleep) can
stabilize measurements.
Vaidy, please help understand this behaviour.
== Comment: #17 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-04-22 14:22:11 ==
Hi Team,
Interesting observation. Let me give possible contributing factors:
(a) When running on ondemand, cpu frequency changed from min to max including turbo frequencies.
(b) When running performance governor, frequency is set to constantly run turbo.
Based on temperature, CPU may not be able to sustain turbo since we
are constantly running at the frequency and burning more power. The
variation could actually come from the fact that we the platform (OCC)
could drop the frequency periodically due to over temperature.
While running ondemand, turning down the power could help sustain the turbo frequency longer.
Disabling snooze will further increase the power consumption and push for more variation at turbo frequency.
Our systems are designed to run consistently at nominal frequency and
hence I would suggest that you run your experiment by setting nominal
frequency to all cores using performance governor+max limit or
userspace governor.
You could use "Throughput-performance profile" using tuned-adm for
this purpose.
If running in "Nominal" Frequency gives you consistent performance,
then the above theory of turbo mode variation holds good. We can
confirm them with additional traces in cpufreq back-end driver code.
We are currently improving our instrumentation to detect frequency
variation and throttling. This is a good scenario to validate our
trace design as well.
Let me know what you find.
--Vaidy
== Comment: #18 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-22 14:28:15 ==
(In reply to comment #17)
> Disabling snooze will further increase the power consumption and push for
> more variation at turbo frequency.
We actually see the opposite effect, disabling snooze makes the
variability at turbo freq go away :)
== Comment: #19 - Basu Vaidyanathan <basu@xxxxxxxxxx> - 2015-04-22 14:44:43 ==
Additionally, this is not a problem when running BE kernel, on the same P8 configuration box. I suspect
it is more to do with configuration settings on LE before we start pointing finger at the FW codepath
when using Ubuntu LE.
== Comment: #20 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-22 15:23:43 ==
Bob is finding another distro LE does _not_ exhibit variation.
This would seem to eliminate LE as the culprit.
Looking at the settings of
/sys/devices/system/cpu/cpu*/cpuidle/state0/disable, they all report
"0", which I believe is the same as having "snooze" enabled, correct?
That would seem to eliminate "snooze" in and of itself as a culprit,
*at least with this kernel level (3.10.0-210.ael7a)*.
I'm starting to suspect it's an issue with the kernel in Ubuntu
(3.16...)
== Comment: #21 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> -
2015-04-22 15:31:41 ==
Running at constant nominal frequency will help you eliminate turbo
mode variation and focus on the Linux issues and root-cause faster.
The behavior I described above is not a bug or problem in firmware.
It is the expected and correct behavior where throttling can happen.
I am only trying to help you to reduce the number of variables that is
affecting this experiment.
--Vaidy
== Comment: #22 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-04-22 15:35:45 ==
(In reply to comment #20)
This is good input. The other distro does not have fast-sleep
support. We will have only snooze and nap. On the Ubuntu system do
you see /sys/devices/system/cpu/cpu*/cpuidle/state2/name ?
Disabling fast-sleep state if present in your Ubuntu setup could help
us to the next step.
--Vaidy
== Comment: #23 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 16:30:28 ==
On the different distro LE system provided by Paul Clarke, the observed behavior is different than what I have seen on Ubuntu LE systems, but one of the tests ... the MPI-enabled simple loop ... shows huge timing variations core-to-core for nearly every job. That system has 24 cores in smt8 mode
ppc64_cpu --frequency
Power Savings Mode: Dynamic, Favor Performance
min: 3.961 GHz (cpu 175)
max: 3.963 GHz (cpu 1)
avg: 3.962 GHz
and nearly every job provides output that looks like this :
out.10:tmin = 3.757, tmax = 6.519 on rank 17, tavg = 5.126
meaning that it takes anywhere from 3.757 to 6.519 seconds to get
through the timed loop :
MPI_Barrier(MPI_COMM_WORLD);
t1 = MPI_Wtime();
sum = 0.0;
for (i=0; i<2000000000; i++) sum += ((double) (i%10));
t2 = MPI_Wtime();
elapsed = t2 - t1;
There are no loads or stores in that loop ... there is a separate
process bound to each core, and they work independently. Additional
instrumentation shows that the slow processes are in the run queue the
whole time.
So far, the other work loads that I have tried on the different distro
LE system showed significantly lower timing variations than what I had
recorder on Ubuntu LE ... but not this one.
== Comment: #24 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 16:54:07 ==
Just adding that on the same different distro LE system, after turning off SMT via the command : ppc64_cpu --smt=1, all instances of the simple loop test have outputs like this :
tmin = 3.756, tmax = 3.757 on rank 5, tavg = 3.757
in other words it takes the same time to complete the work in the loop
on every core ... every time, within the limits of what I have had
the patience to check.
== Comment: #25 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-22 17:03:16 ==
Bob, the use of ST mode reduces variation on Ubuntu 14.04.2 as well.
With SMT8 on another distro LE, I wonder whether "cpupower idle-set -d
0" helps reduce variation for the MPI-enabled simple loop?
Is it correct to say that both Ubuntu LE 14.04.2 (kernel 3.16.0) and
another distro LE (kernel) exhibit variation?
Vaidy, Ubuntu 14.4.2 does not have cpuidle/state2 (fastsleep state).
== Comment: #26 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 17:11:42 ==
I ran the command :
[root@tuleta ~]# cpupower idle-set -d 0
Idlestate 0 disabled on CPU 0
Idlestate 0 disabled on CPU 1
...
on the different distro LE system after setting the state back to
smt8, and the timing variability is still there :
out.2:tmin = 3.757, tmax = 9.010 on rank 4, tavg = 4.619
out.3:tmin = 3.757, tmax = 11.518 on rank 2, tavg = 4.684
out.4:tmin = 3.757, tmax = 9.398 on rank 3, tavg = 4.773
Essentially every job is showing truly huge timing variations.
== Comment: #27 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-22 17:24:46 ==
Does it make any difference with "cpupower idle-set -d 1"? to disable Nap too?
I think we only have snooze and Nap on LE.
== Comment: #28 - Basu Vaidyanathan <basu@xxxxxxxxxx> - 2015-04-22 17:46:14 ==
(In reply to comment #27)
I have a p8 box running ubuntu 14.10 and I do see
cat /sys/devices/system/cpu/cpu0/cpuidle/state2/name
FastSleep
== Comment: #29 - Preeti U. Murthy <preeti.murthy@xxxxxxxxxx> - 2015-04-23 06:01:57 ==
I see that there are hotplug operations being carried out simultaneously with running the benchmark. If so, the performance degradation could be due to the tasks being not allowed to run on the freshly onlined cpus.
I would suggest boot a system with all hardware threads and not do
hotplug operations in order to keep the above issue away while
verifying the performance of the benchmarks, if the intention is to
profile the cpufreq governors.
Regards
Preeti U Murthy
== Comment: #31 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-28 00:27:52 ==
On Ubuntu 14.04.2, there are two states in cpuidle: snooze and Nap.
Are the enabling and disabling of these two states independent?
== Comment: #32 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-28 16:16:23 ==
Adding an observation on ubuntu le systems, using the simple-loop example above and the userspace governor (chosen so that one can set the frequency to a desired value). When using one thread per core with the system in SMT8 state, the time for the loop varies from ~3.7 sec to over 8 sec. However, if a lot of iterations (10-20) of the same loop are done before starting the timed section of the code (adding a warmup loop), the variations in the timed section are dramatically reduced. There are still some outliers, but a much smaller number of them; and the timing spread is a fraction of one second, instead of several seconds. So there is a clear dependence on history, with the largest timing variations occurring immediately after job startup. I should mention that this remains a problem for many performance benchmarks in the HPC area, which often run in a total time of less than one minute. I would hope that with the userspace governor, or the performance governor, the power and frequency settings would remain constant. Can someone confirm that?
== Comment: #33 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-29 17:16:58 ==
Vaidy, would you help answer my question on Comment 31?
== Comment: #34 - George A. Chochia <chochia@xxxxxxxxxx> - 2015-05-13 11:52:53 ==
Vaidy, I am currently seeing a 2.5x performance degradation in the Message Rate benchmark on p8, Ubuntu 14.04.02 LE.
Performance was normal back in February, when we had 14.04.01 and
older FW.
The degradation goes away once snooze state is disabled. There have
been two FW updates: 1/13 and 2/17.
== Comment: #35 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-13 14:35:37 ==
(In reply to comment #31)
> On Ubuntu 14.04.2, there are two states in cpuidle: snooze and Nap.
>
> Are the enabling and disabling of these two states independent?
Hi Peter,
Yes the enable/disable for idle states are independent. Atleast 1
idle state is expected to be enabled, if not the CPU may busy loop at
idle and not reduce the thread priority like snooze.
You can disable snooze and have nap enabled or the other way, but
having both disabled will lead to idle threads burning more cycles.
--Vaidy
== Comment: #36 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-13 14:58:07 ==
(In reply to comment #34)
Hi George,
The idle state management code is same for both the kernels. You have
only snooze and nap as idle states right?
As I explained over email, when snooze and nap are enabled, the
cpuidle logic should choose nap for idle sibling threads after a short
period in snooze.
Can you guys analyse and confirm that following points:
* Workloads is run on primary thread on each core always
* Remaining 7 sibling threads should be in nap (state1)
* Time spend in 'nap' state for each of the sibling threads can be obtained from sysfs
/sys/devices/system/cpu/cpuN/cpuidle/state1/time (unit is micro secs)
* Workload variation is related to nap residency of sibling threads on that core
If the nap residency (time spent in nap) is not uniform then workload
performance would be proportionally non uniform.
The above statement (if proven) is one possible root-cause, that can
help us move forward and design a fix.
--Vaidy
== Comment: #37 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-05-13 17:45:33 ==
Hi Vaidy,
Let's use Bob's serial_loop.c as an example. There are 24 copies of
his program running on 24 cores in parallel. Only the primary threads
of the cores are used.
Did Shilpa use Bob's program to re-create the problem and find out
that some unused sibling threads do not sleep fast enough and take
away cycles from the primary thread to cause variability?
It is great to know that we can study the sleep time by examining the
/sys/devices/system/cpu/cpuN/cpuidle/state1/time. Did Shilpa use this
method to come up with the above understanding?
Based on George's finding, do you know whether there are thermal code
changes in the old firmware that affects the thermal behavior in the
current version?
Thanks,
Peter
== Comment: #38 - Preeti U. Murthy <preeti.murthy@xxxxxxxxxx> - 2015-05-13 23:24:18 ==
Is this really related to snooze ? Jennifer mentioned in Comment 10 that disabling nap and not snooze also reduced the variance ? Can you please confirm if this is the case ? This will help us narrow down on the issue.
Regards
Preeti U Murthy
== Comment: #39 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-14 10:19:09 ==
(In reply to comment #38)
Hi Preeti, sorry I corrected myself in comment 11, I was disabling state0 which is snooze, not nap:
# cpupower idle-set -d 0
# cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
snooze
Still might be interesting to try some tests w/ nap disabled.
== Comment: #40 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-05-14 11:15:45 ==
(In reply to comment #37)
Yes . I also used perf-trace events to get the same info.
Regards,
Shilpa
== Comment: #42 - Anton Blanchard <antonb@xxxxxxxxxxx> - 2015-05-19 19:40:45 ==
If I am reading that trace right, we spent over 200ms in snooze on a secondary thread of a badly performing core. That is an enormous amount of time to be chewing up the core.
== Comment: #43 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-05-19 21:45:20 ==
Vaidy,
Could you provide more information on your proposed solution which is
in the kernel, not in OPAL?
Does it mean that you need to upstream different patches to set of
kernels for Ubuntu and other distro?
Peter
== Comment: #44 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-20 10:56:48 ==
(In reply to comment #42)
Hi Anton,
That is right, exit from snooze state is the problem. In the proposed
fix Shilpa has added a forced exit from snooze loop after the target
residency so that the cpuidle governor can select nap.
We have to rewrite the snooze loop and exit after the first interrupt
or timer or after after target residency (100us) so that the idle
state promotion can happen.
--Vaidy
== Comment: #45 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-05-20 11:02:06 ==
Hi,
I am sharing the link for ubuntu kernel packages with the fix:
1) http://kernel.stglabs.ibm.com/~shilpa/ubuntu-14-04.tar
This file contains the following packages:
a)linux-headers-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
b)linux-image-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
c)linux-image-extra-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
d)linux-tools-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
The fix is based on top of ubuntu-14.-04.02 3.16.0-38-generic + upstream commit (92c83ff5b42b cpuidle: powernv: Read target_residency value of idle states from DT if available)
2) http://kernel.stglabs.ibm.com/~shilpa/ubuntu-15.04.tar
This file contains the following packages:
linux-headers-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
linux-image-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
linux-image-extra-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
linux-tools-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
The fix is based on top of ubuntu-15.04 3.19.0-17-generic
== Comment: #46 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-20 11:21:07 ==
(In reply to comment #43)
Hi Peter,
Sure. As per our discussion yesterday, we agreed on the following:
* The issue is not machine specific, the problem was recreated by
Jenifer on S822L also even though other teams believe the issue is
S824L specific.
* The key issue observed is the sibling thread's snooze time variation
which chews cycles from primary thread.
* The fix is to force exit snooze loop after target residency (100us)
and allow the cpuidle governor to enter nap.
* This fix is completely in Linux kernel cpuidle driver code and does
not require change in OPAL.
Yes, once we verify the solution, we will design the correct idle
state auto-promotion logic in cpuidle driver and get it upstream and
then push to the other distro and ubuntu distros that run bare-metal.
--Vaidy
== Comment: #47 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-20 12:44:17 ==
I tested Shilpa's kernel packages w/ the fix and can confirm I no longer see the variation issue w/ the serial loop program running on primary threads in SMT8 mode when the performance governor is set. I will get with Peter to test with another benchmark that previously hit the variation issue.
----
System:
8247-42L
20 cores, SMT8
FW830_041
Ubuntu 15.04
Run script:
#!/bin/bash
for iter in `seq 1 100`
do
for cpu in 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152
do
taskset -c ${cpu} ./serial_loop > out.${cpu}.${iter} &
done
echo $iter
wait
done
Results:
-- 3.19.0-17 fix --
Performance
-----------
Loop elapsed: User time:
Min Max Min Max
3.885 3.92 3.877 3.914
3.885 3.892 3.877 3.886
3.885 3.908 3.877 3.901
Ondemand
--------
Loop elapsed: User time:
Min Max Min Max
3.933 3.949 3.901 3.912
-- orig 3.19.0-16 kernel --
Performance
-----------
Loop elapsed: User time:
Min Max Min Max
3.886 4.507 3.88 4.498
3.884 10.404 3.877 10.39
Ondemand
--------
Loop elapsed: User time:
Min Max Min Max
3.932 3.994 3.901 3.959
== Comment: #49 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-21 18:59:33 ==
The fix from comment #45 also resolves large variance issues w/ STREAM and DGEMM workloads. Results listed below.
=========================================
STREAM:
MB/sec
SMT8, 1 thread per core, 100 loop
-------- orig 3.19.0-16 kernel --------
Performance:
____________
Min Max %diff
run1: 304384.6341 308199.3341 1.25%
run2: 150096.0562 308516.5557 69.09%
Performance
+ disable snooze:
_________________
Min Max %diff
run1: 305700.3257 308403.9185 0.88%
run2: 305547.2215 308771.2772 1.05%
Ondemand:
_________
Min Max %diff
run1: 298386.1295 302209.7456 1.27%
----------- 3.19.0-17 fix -----------
Performance:
____________
Min Max %diff
run1: 303486.8368 308433.0545 1.62%
run2: 304768.6159 308410.2177 1.19%
run3: 304723.2556 308847.065 1.34%
Ondemand:
_________
Min Max %diff
run1: 297364.385 302473.0888 1.70%
=========================================
=========================================
DGEMM:
GFlops
SMT8, 1 thread per core, 20 loop
-------- orig 3.19.0-16 kernel --------
Performance:
____________
Min Max %diff
run1: 479.53 520.2 8.14%
Performance
+ disable snooze:
_________________
Min Max %diff
run1: 511.18 520.49 1.80%
Ondemand:
_________
Min Max %diff
run1: 505.64 509.88 0.84%
----------- 3.19.0-17 fix -----------
Performance:
____________
Min Max %diff
run1: 512.77 520.84 1.56%
run2: 517.19 520.34 0.61%
run3: 517.93 520.35 0.47%
Ondemand:
_________
Min Max %diff
run1: 505.72 508.53 0.55%
== Comment: #51 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-06-14 22:53:05 ==
Vaidy, is this fix being reviewed by the Linux kernel community? Can you give some estimates as to when this kernel fix will get into mainline and also when it will get into Ubuntu distro?
== Comment: #52 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-06-24 07:18:28 ==
The patch can be found in the upstream kernel 4.2
78eaa10f027c cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1470404/+subscriptions