← Back to team overview

kernel-packages team mailing list archive

[Bug 1470404] [NEW] Some workloads experience more measurement variation with scaling_governor=performance than ondemand

 

You have been subscribed to a public bug:

== Comment: #0 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-15 21:30:31 ==
---Problem Description---
Many workloads experience wide measurement variation, more with scaling_governor=performance than ondemand. 
 
Contact Information = wpeter@xxxxxxxxxx, farid@xxxxxxxxxx 
 
---uname output---
Linux c656f7n04 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:42:36 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = 20-core and 24-core Tuleta systems 
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
Set the system with the SMT8 mode and scaling_governor=performance or ondemand.
Run the workload 100 times.
Get 100 data points and sort them.
Compare the spread of results with two governor modes.
The source and scripts to run a simple test case will be provided.
 
Stack trace output:
 no
 
Oops output:
 no
 
Userspace tool common name: not sure what it is. 

Userspace rpm: ?? 
 
The userspace tool has the following bit modes: These are 64-bit programs. 
 
System Dump Info:
  The system is not configured to capture a system dump.

Userspace tool obtained from project website:  na 
 
*Additional Instructions for wpeter@xxxxxxxxxx, farid@xxxxxxxxxx: 
-Attach sysctl -a output output to the bug.
-Attach ltrace and strace of userspace application.

== Comment: #2 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-16 08:47:41 ==
This problem has a number of variables we were trying to reduce:
- endianness
- operating system
- kernel level
- compiler

Bob Walkup says he's seen the variability in a bunch of CPU-intensive
test cases, in various languages, using various compilers, which would
seem to eliminate the "compiler" variable.

We had not looked at the performance governor setting to this point.
Interesting results, and yet another variable to add to the above mix.
Perhaps two more runs?  (LE-ondemand, LE-performance, BE-ondemand, BE-
performance)

== Comment: #3 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-16 08:50:09 ==
Also, Bob says he can reproduce this with and without vectorization (the stalls move from the VSU to the FPU), and with and without floating point (the stalls move from the FPU to the FXU).  Very odd.

== Comment: #4 - Andrea M. Davis <amdavis@xxxxxxxxxx> - 2015-04-16 10:10:01 ==
Peter, what version of Ubuntu are you running?

== Comment: #5 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 10:32:58 ==
Andrea,

Ubuntu 14.04.2 LTS.

#uname -a
Linux c656f7n04 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:42:36 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux

#lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.2 LTS
Release:	14.04
Codename:	trusty

== Comment: #6 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 10:50:11 ==
There are a few more things we have tried.

(1) For STREAM, it was originally compiled with gfotran and its
corresponding OpenMP. I compiled it with xlf and its corresponding
OpenMP. There is no difference in performance.

(2) There was a concern about NUMA, meaning is it possible the CPU
binding by OpenMP is incorrect so that there are remote memory accesses
behind the scene? By disabling one DCM and using 10 or 12 cores only in
the other DCM, we can still see occasional drops in performance,
although not often. We can conclude it is not due to NUMA.

(3) Farid and I also tried out different scheduler parameters
(sched_min_granularity_ns, sched_wakeup_granularity_ns,
sched_latency_ns, and others) and matched the correponding the other
distro's values, but did not see performance changes.

(4) For the workload AMG2006, the use of scaling_performance=ondemand
also reduces the spread of data significantly.

(5) For all the above investigations, I used a 20-core Tuleta and a
24-core Tuleta, although they are configured identically with Ubuntu
14.04.2. I mean two systems paint a consistent picture.

So far, we looked at compiler, NUMA, scheduler, memory test, CPU test,
ST vs SMT, etc. There is a significant difference in variation between
scaling_governor=performance and scaling_govenor=ondemand with the same
application and system configurations.

Hopefully, the data point us to the right direction, i.e., there could
be some unexpected behaviour with the implementation of
scaling_governor=performance.

== Comment: #7 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-16 14:30:21 ==
Note that Bob Walkup does not see the improvement using scaling_governor=ondemand on a borrowed POK lab system. However, he still suggested me to open a bug based on my findings. I guess he is not totally sure about the system he got.

It would be good to have data independently collected by others to
verify my observations.

Bob's serial_loop.c program can be compiled and run very easily. The
examination of data is straightforward too.

== Comment: #10 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-17 16:33:38 ==
I was able to reproduce the problem with the serial_loop test described in comment 1 (my system is Ubunu 15.04), however disabling the nap cpuidle state seemed to resolve the variance:

cpupower idle-set -d 0

Can others reproduce?   I am not sure why nap behavior would be any
different w/ the performance governor though..   Note, to re-enable:
cpupower idle-set -E

== Comment: #11 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-20 13:09:34 ==
(In reply to comment #10)
> disabling the nap cpuidle
> state seemed to resolve the variance:
> 
> cpupower idle-set -d 0

just want to clarify state0 is actually snooze, not nap:
# cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
snooze
# cat /sys/devices/system/cpu/cpu0/cpuidle/state1/name
Nap

== Comment: #12 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-20 16:26:32 ==
Jenifer, thanks for the suggestion.

"cpupower idle-set -d 0" works for Bob's serial_loop.c program.

There are 24 identical processes running serial_loop in parallel, each
bound to one core. With 100 iterations, there are 2400 elapsed times
collected for each run. Each elapsed time over 5 seconds is counted as
an outlier.

The following data were collected on a 24-core Tuleta system.

Scaling_govenor = P(erformance) or O(ndemand)
snooze (state0) = default (enabled) and disabled

P and default                = 34-35 outliers
P and snooze disabled = 0 outliers

O and default               = 2-4 outliers
O and snooze disabled = 0 outliers

As you asked, why do we need to disable snooze in order to reduce
measurement variation when scaling_governor=performance?

== Comment: #13 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-20 16:40:46 ==
Vaidy,  could your team comment on this?  In SMT8 mode, more measurement variation is seen using the performance governor compared to the ondemand governor when snooze is enabled, but disabling snooze seems to resolve the problem. Does it make sense that snooze impacts would be higher in performance mode?  

Stewart mentioned some latency improvements in the new 830 OPAL
firmware, is that related to this type of sleep state wakeup?


== Comment: #14 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-21 12:23:01 ==
"cpupower idle-set -d 0" also fixes the measurement variation of STREAM on a 24-core Tuleta system.

scaling_governor=performance and default snooze = 65 outliers out of 400
runs.

scaling_governor=performance and snooze disabled = 0 outlier out of 400
runs.

== Comment: #15 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-21 23:21:22 ==
"cpupower idle-set -d 0" also fixes the measurement variation of AMG2006 on a 24-core Tuleta system.

It means when scaling_governor=performance, disabling snooze (state0,
shallow sleep) while still enabling Nap (state1, deep sleep) can
stabilize measurements.

Vaidy,  please help understand this behaviour.


== Comment: #17 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-04-22 14:22:11 ==
Hi Team,

Interesting observation.  Let me give possible contributing factors:

(a) When running on ondemand, cpu frequency changed from min to max including turbo frequencies.
(b) When running performance governor, frequency is set to constantly run turbo.

Based on temperature, CPU may not be able to sustain turbo since we are
constantly running at the frequency and burning more power.  The
variation could actually come from the fact that we the platform (OCC)
could drop the frequency periodically due to over temperature.

While running ondemand, turning down the power could help sustain the turbo frequency longer.
Disabling snooze will further increase the power consumption and push for more variation at turbo frequency.

Our systems are designed to run consistently at nominal frequency and
hence I would suggest that you run your experiment by setting nominal
frequency to all cores using performance governor+max limit or userspace
governor.

You could use "Throughput-performance profile" using tuned-adm for this
purpose.

If running in "Nominal" Frequency gives you consistent performance, then
the above theory of turbo mode variation holds good.  We can confirm
them with additional traces in cpufreq back-end driver code.  We are
currently improving our instrumentation to detect frequency variation
and throttling.  This is a good scenario to validate our trace design as
well.

Let me know what you find.

--Vaidy

== Comment: #18 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-04-22 14:28:15 ==
(In reply to comment #17)

> Disabling snooze will further increase the power consumption and push for
> more variation at turbo frequency.

We actually see the opposite effect, disabling snooze makes the
variability at turbo freq go away :)

== Comment: #19 - Basu Vaidyanathan <basu@xxxxxxxxxx> - 2015-04-22 14:44:43 ==
Additionally, this is not a problem when running BE kernel, on the same P8 configuration box. I suspect
it is more to do with configuration settings on LE before we start pointing finger at the FW codepath
when using Ubuntu LE.

== Comment: #20 - Paul A. Clarke <pacman@xxxxxxxxxx> - 2015-04-22 15:23:43 ==
Bob is finding another distro LE does _not_ exhibit variation.

This would seem to eliminate LE as the culprit.

Looking at the settings of
/sys/devices/system/cpu/cpu*/cpuidle/state0/disable, they all report
"0", which I believe is the same as having "snooze" enabled, correct?
That would seem to eliminate "snooze" in and of itself as a culprit, *at
least with this kernel level (3.10.0-210.ael7a)*.

I'm starting to suspect it's an issue with the kernel in Ubuntu
(3.16...)

== Comment: #21 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> -
2015-04-22 15:31:41 ==

Running at constant nominal frequency will help you eliminate turbo mode
variation and focus on the Linux issues and root-cause faster.

The behavior I described above is not a bug or problem in firmware.  It
is the expected and correct behavior where throttling can happen.  I am
only trying to help you to reduce the number of variables that is
affecting this experiment.

--Vaidy

== Comment: #22 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-04-22 15:35:45 ==
(In reply to comment #20)

This is good input.  The other distro does not have fast-sleep support.
We will have only snooze and nap.  On the Ubuntu system do you see
/sys/devices/system/cpu/cpu*/cpuidle/state2/name ?

Disabling fast-sleep state if present in your Ubuntu setup could help us
to the next step.

--Vaidy

== Comment: #23 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 16:30:28 ==
On the different distro LE system provided by Paul Clarke, the observed behavior is different than what I have seen on Ubuntu LE systems, but one of the tests ... the MPI-enabled simple loop ... shows huge timing variations core-to-core for nearly every job.  That system has 24 cores in smt8 mode 

ppc64_cpu --frequency
Power Savings Mode: Dynamic, Favor Performance
min:    3.961 GHz (cpu 175)
max:    3.963 GHz (cpu 1)
avg:    3.962 GHz

and nearly every job provides output that looks like this :
out.10:tmin = 3.757, tmax = 6.519 on rank 17, tavg = 5.126

meaning that it takes anywhere from 3.757 to 6.519 seconds to get
through the timed loop :

   MPI_Barrier(MPI_COMM_WORLD);
   t1 = MPI_Wtime();
   sum = 0.0;
   for (i=0; i<2000000000; i++) sum += ((double) (i%10)); 
   t2 = MPI_Wtime();
   elapsed = t2 - t1;

There are no loads or stores in that loop ... there is a separate
process bound to each core, and they work independently.  Additional
instrumentation shows that the slow processes are in the run queue the
whole time.

So far, the other work loads that I have tried on the different distro
LE system showed significantly lower timing variations than what I had
recorder on Ubuntu LE ... but not this one.

== Comment: #24 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 16:54:07 ==
Just adding that on the same different distro LE system, after turning off SMT via the command : ppc64_cpu --smt=1, all instances of the simple loop test have outputs like this :

tmin = 3.756, tmax = 3.757 on rank 5, tavg = 3.757

in other words it takes the same time to complete the work in the loop
on every core ... every time,  within the limits of what I have had the
patience to check.

== Comment: #25 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-22 17:03:16 ==
Bob, the use of ST mode reduces variation on Ubuntu 14.04.2 as well.

With SMT8 on another distro LE, I wonder whether "cpupower idle-set -d
0" helps reduce variation for the MPI-enabled simple loop?

Is it correct to say that both Ubuntu LE 14.04.2 (kernel 3.16.0) and
another distro LE (kernel) exhibit variation?

Vaidy, Ubuntu 14.4.2 does not have cpuidle/state2 (fastsleep state).

== Comment: #26 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-22 17:11:42 ==
I ran the command :

[root@tuleta ~]# cpupower idle-set -d 0
Idlestate 0 disabled on CPU 0
Idlestate 0 disabled on CPU 1
...

on the different distro LE system after setting the state back to smt8,
and the timing variability is still there :

out.2:tmin = 3.757, tmax = 9.010 on rank 4, tavg = 4.619
out.3:tmin = 3.757, tmax = 11.518 on rank 2, tavg = 4.684
out.4:tmin = 3.757, tmax = 9.398 on rank 3, tavg = 4.773

Essentially every job is showing truly huge timing variations.

== Comment: #27 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-22 17:24:46 ==
Does it make any difference with "cpupower idle-set -d 1"? to disable Nap too?

I think we only have snooze and Nap on LE.

== Comment: #28 - Basu Vaidyanathan <basu@xxxxxxxxxx> - 2015-04-22 17:46:14 ==
(In reply to comment #27)

I have a p8 box running ubuntu 14.10 and I do see 
cat /sys/devices/system/cpu/cpu0/cpuidle/state2/name
FastSleep

== Comment: #29 - Preeti U. Murthy <preeti.murthy@xxxxxxxxxx> - 2015-04-23 06:01:57 ==
I see that there are hotplug operations being carried out simultaneously with running the benchmark. If so, the performance degradation could be due to the tasks being not allowed to run on the freshly onlined cpus. 

I would suggest boot a system with all hardware threads and not do
hotplug operations in order to keep the above issue away while verifying
the performance of the benchmarks, if the intention is to profile the
cpufreq governors.

Regards
Preeti U Murthy

== Comment: #31 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-28 00:27:52 ==
On Ubuntu 14.04.2, there are two states in cpuidle: snooze and Nap.

Are the enabling and disabling of these two states independent?

== Comment: #32 - Robert E. Walkup <walkup@xxxxxxxxxx> - 2015-04-28 16:16:23 ==
Adding an observation on ubuntu le systems, using the simple-loop example above and the userspace governor (chosen so that one can set the frequency to a desired value).  When  using one thread per core with the system in SMT8 state, the time for the loop varies from ~3.7 sec to over 8 sec.  However, if a lot of iterations (10-20) of the same loop are done before starting the timed section of the code (adding a warmup loop), the variations in the timed section are dramatically reduced.  There are still some outliers, but a much smaller number of them; and the timing spread is a fraction of one second, instead of several seconds.  So there is a clear dependence on history, with the largest timing variations occurring immediately after job startup.  I should mention that this remains a problem for many performance benchmarks in the HPC area, which often run in a total time of less than one minute.  I would hope that with the userspace governor, or the performance governor, the power and frequency settings would remain constant.  Can someone confirm that?

== Comment: #33 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-04-29 17:16:58 ==
Vaidy, would you help answer my question on Comment 31?

== Comment: #34 - George A. Chochia <chochia@xxxxxxxxxx> - 2015-05-13 11:52:53 ==
Vaidy, I am currently seeing a 2.5x performance degradation in the Message Rate benchmark on p8, Ubuntu 14.04.02 LE.

Performance was normal back in February, when we had 14.04.01 and older
FW.

The degradation goes away once snooze state is disabled. There have been
two FW updates: 1/13 and 2/17.

== Comment: #35 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-13 14:35:37 ==
(In reply to comment #31)
> On Ubuntu 14.04.2, there are two states in cpuidle: snooze and Nap.
> 
> Are the enabling and disabling of these two states independent?

Hi Peter,

Yes the enable/disable for idle states are independent.  Atleast 1 idle
state is expected to be enabled, if not the CPU may busy loop at idle
and not reduce the thread priority like snooze.

You can disable snooze and have nap enabled or the other way, but having
both disabled will lead to idle threads burning more cycles.

--Vaidy

== Comment: #36 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-13 14:58:07 ==
(In reply to comment #34)

Hi George,

The idle state management code is same for both the kernels.  You have
only snooze and nap as idle states right?

As I explained over email, when snooze and nap are enabled, the cpuidle
logic should choose nap for idle sibling threads after a short period in
snooze.

Can you guys analyse and confirm that following points:

* Workloads is run on primary thread on each core always
* Remaining 7 sibling threads should be in nap (state1)
* Time spend in 'nap' state for each of the sibling threads can be obtained from sysfs
/sys/devices/system/cpu/cpuN/cpuidle/state1/time (unit is micro secs)
* Workload variation is related to nap residency of sibling threads on that core

If the nap residency (time spent in nap) is not uniform then workload
performance would be proportionally non uniform.

The above statement (if proven) is one possible root-cause, that can
help us move forward and design a fix.

--Vaidy

== Comment: #37 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-05-13 17:45:33 ==
Hi Vaidy,

Let's use Bob's serial_loop.c as an example. There are 24 copies of his
program running on 24 cores in parallel. Only the primary threads of the
cores are used.

Did Shilpa use Bob's program to re-create the problem and find out that
some unused sibling threads do not sleep fast enough and take away
cycles from the primary thread to cause variability?

It is great to know that we can study the sleep time by examining the
/sys/devices/system/cpu/cpuN/cpuidle/state1/time. Did Shilpa use this
method to come up with the above understanding?

Based on George's finding, do you know whether there are thermal code
changes in the old firmware that affects the thermal behavior in the
current version?

Thanks,
Peter

== Comment: #38 - Preeti U. Murthy <preeti.murthy@xxxxxxxxxx> - 2015-05-13 23:24:18 ==
Is this really related to snooze ? Jennifer mentioned in Comment 10 that disabling nap and not snooze also reduced the variance ? Can you please confirm if this is the case ? This will help us narrow down on the issue.

Regards
Preeti U Murthy

== Comment: #39 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-14 10:19:09 ==
(In reply to comment #38)
Hi Preeti, sorry I corrected myself in comment 11, I was disabling state0 which is snooze, not nap:
# cpupower idle-set -d 0
# cat /sys/devices/system/cpu/cpu0/cpuidle/state0/name
snooze

Still might be interesting to try some tests w/ nap disabled.

== Comment: #40 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-05-14 11:15:45 ==
(In reply to comment #37)
Yes . I also used perf-trace events to get the same info.

Regards,
Shilpa

== Comment: #42 - Anton Blanchard <antonb@xxxxxxxxxxx> - 2015-05-19 19:40:45 ==
If I am reading that trace right, we spent over 200ms in snooze on a secondary thread of a badly performing core. That is an enormous amount of time to be chewing up the core.

== Comment: #43 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-05-19 21:45:20 ==
Vaidy,

Could you provide more information on your proposed solution which is in
the kernel, not in OPAL?

Does it mean that you need to upstream different patches to set of
kernels for Ubuntu and other distro?

Peter

== Comment: #44 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-20 10:56:48 ==
(In reply to comment #42)
Hi Anton,

That is right, exit from snooze state is the problem.  In the proposed
fix Shilpa has added a forced exit from snooze loop after the target
residency so that the cpuidle governor can select nap.

We have to rewrite the snooze loop and exit after the first interrupt or
timer or after after target residency (100us) so that the idle state
promotion can happen.

--Vaidy

== Comment: #45 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-05-20 11:02:06 ==
 Hi,

I am sharing the link for ubuntu kernel packages with the fix:

1) http://kernel.stglabs.ibm.com/~shilpa/ubuntu-14-04.tar
    This file contains the following packages:
    a)linux-headers-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
    b)linux-image-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
    c)linux-image-extra-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
    d)linux-tools-3.16.0-38-generic_3.16.0-38.52~14.04.1_ppc64el.deb
    The fix is based on top of ubuntu-14.-04.02 3.16.0-38-generic + upstream commit (92c83ff5b42b  cpuidle: powernv: Read target_residency value of idle states from DT if available)

2) http://kernel.stglabs.ibm.com/~shilpa/ubuntu-15.04.tar
    This file contains the following packages:
    linux-headers-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
    linux-image-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
    linux-image-extra-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
    linux-tools-3.19.0-17-generic_3.19.0-17.17+snooze_ppc64el.deb
    The fix is based on top of ubuntu-15.04 3.19.0-17-generic

== Comment: #46 - VAIDYANATHAN SRINIVASAN <svaidyan@xxxxxxxxxx> - 2015-05-20 11:21:07 ==
(In reply to comment #43)

Hi Peter,

Sure.  As per our discussion yesterday, we agreed on the following:

* The issue is not machine specific, the problem was recreated by
Jenifer on S822L also even though other teams believe the issue is S824L
specific.

* The key issue observed is the sibling thread's snooze time variation
which chews cycles from primary thread.

* The fix is to force exit snooze loop after target residency (100us)
and allow the cpuidle governor to enter nap.

* This fix is completely in Linux kernel cpuidle driver code and does
not require change in OPAL.

Yes, once we verify the solution, we will design the correct idle state
auto-promotion logic in cpuidle driver and get it upstream and then push
to the other distro and ubuntu distros that run bare-metal.

--Vaidy

== Comment: #47 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-20 12:44:17 ==
I tested Shilpa's kernel packages w/ the fix and can confirm I no longer see the variation issue w/ the serial loop program running on primary threads in SMT8 mode when the performance governor is set.   I will get with Peter to test with another benchmark that previously hit the variation issue.

----

System:
8247-42L
20 cores, SMT8
FW830_041 
Ubuntu 15.04

Run script:
#!/bin/bash

for iter in `seq 1 100`
do
  for cpu in 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152
  do
  taskset -c ${cpu} ./serial_loop > out.${cpu}.${iter} &
  done
  echo $iter
  wait
done

Results:

-- 3.19.0-17 fix -- 	
					
Performance
-----------
Loop elapsed:		User time:
Min	Max		Min	Max
3.885	3.92		3.877	3.914
3.885	3.892		3.877	3.886
3.885	3.908		3.877	3.901			

Ondemand
--------
Loop elapsed:		User time:
Min	Max		Min	Max
3.933	3.949		3.901	3.912


-- orig 3.19.0-16 kernel --
					
Performance
-----------
Loop elapsed:		User time:
Min	Max		Min	Max
3.886	4.507		3.88	4.498
3.884	10.404		3.877	10.39

Ondemand
--------
Loop elapsed:		User time:
Min	Max		Min	Max
3.932	3.994		3.901	3.959

== Comment: #49 - JENIFER HOPPER <jhopper@xxxxxxxxxx> - 2015-05-21 18:59:33 ==
The fix from comment #45 also resolves large variance issues w/ STREAM and DGEMM workloads. Results listed below.

=========================================
STREAM:

MB/sec				
SMT8, 1 thread per core, 100 loop
				
-------- orig 3.19.0-16 kernel --------
	
Performance:
____________
	Min		Max		%diff
run1:	304384.6341	308199.3341	1.25%
run2: 	150096.0562	308516.5557	69.09%

Performance 
+ disable snooze:
_________________
	Min		Max		%diff
run1:	305700.3257	308403.9185	0.88%
run2: 	305547.2215	308771.2772	1.05%

Ondemand:
_________
	Min		Max		%diff
run1:	298386.1295	302209.7456	1.27%


----------- 3.19.0-17 fix -----------
	
Performance:
____________
	Min		Max		%diff
run1:	303486.8368	308433.0545	1.62%
run2: 	304768.6159	308410.2177	1.19%
run3:	304723.2556	308847.065	1.34%

Ondemand:
_________
	Min		Max		%diff
run1:	297364.385	302473.0888	1.70%

=========================================

=========================================
DGEMM:

GFlops				
SMT8, 1 thread per core, 20 loop
				
-------- orig 3.19.0-16 kernel --------
	
Performance:
____________
	Min		Max		%diff
run1:	479.53		520.2		8.14%

Performance 
+ disable snooze:
_________________
	Min		Max		%diff
run1:	511.18		520.49		1.80%

Ondemand:
_________
	Min		Max		%diff
run1:	505.64		509.88		0.84%


----------- 3.19.0-17 fix -----------
	
Performance:
____________
	Min		Max		%diff
run1:	512.77		520.84		1.56%
run2: 	517.19		520.34		0.61%
run3:	517.93		520.35		0.47%

Ondemand:
_________
	Min		Max		%diff
run1:	505.72		508.53		0.55%

== Comment: #51 - Peter W. Wong <wpeter@xxxxxxxxxx> - 2015-06-14 22:53:05 ==
Vaidy, is this fix being reviewed by the Linux kernel community? Can you give some estimates as to when this kernel fix will get into mainline and also when it will get into Ubuntu distro?

== Comment: #52 - Shilpasri G. Bhat <shigbhat@xxxxxxxxxx> - 2015-06-24 07:18:28 ==
The patch can be found in the upstream kernel 4.2
78eaa10f027c cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New


** Tags: architecture-ppc64le bot-comment bugnameltc-124023 severity-high targetmilestone-inin14043
-- 
Some workloads experience more measurement variation with scaling_governor=performance than ondemand
https://bugs.launchpad.net/bugs/1470404
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.