← Back to team overview

ac100 team mailing list archive

Re: Stability Under Load

 

On 08/21/2011 01:57 PM, Julian Andres Klode wrote:
On Sun, Aug 21, 2011 at 01:09:09PM +0100, Gordan Bobic wrote:
I'm also curious how come my powertop is showing 1000MHz with no
errors in the log when I set SM1 to 975mV.
975mV might be equal to 1000mV, there is some rounding up involved,
and as far as I know, the frequency steps are 50mV.

Really? I the error I was seeing was specifically mentioning 975mV.

Harmony sets minimum voltage to 750 and maximum voltage to 1125,
maybe that gives more stability?

Interesting. It also occurs to me that just tweaking voltages
(which, again, would be much easier if they were run-time adjustable
via /sys as I said in a previous post), it would be really handy to
get core temperature readings? Does the AC100 have temperature
sensors built in?
No, we don't know what happens if we started exposing the various
settings somewhere, when they are read, etc. To unsafe, in my
opinion.

Provided there are limit checks set in place (e.g. hard-code a limit
check so that you can't set the voltage>  1250mV), I don't see what
harm could come of it, other than making stability stress testing
easier.
They are exposed in /sys/class/regulators/, but read-only.

Thanks for pointing that out. :)

There's
also read-only /sys/kernel/debug/clock/dvfs for the current dvfs
state.

Hmm, it seems my /sys/kernel/debug subtree is empty. I'll have a poke through the kernel config and enable it next time I'm rebuilding a kernel.

The question is if the kernel code is ready for having those values
changed at run-time, or whether it reads them at start and builds
other structures out of them. We don't know this, that's why I
said it shouldn't be done.

Fair point. I hadn't considered that.

On an unrelated note, I noticed an interesting possible correlation
of an error in my message log with instability that I am currently
investigating. It is possible that I have been barking up a
completely wrong tree so far. I need to do some more investigating
(A _LOT_ of SLUB memory allocation failures, possibly to do with
zram swapping and/or the size of vmalloc set on the kernel command
line).
SLUB errors come from rt2800usb usually, without the module loaded,
the errors should vanish. You could also try using SLAB instead of
SLUB.

Yes, I did notice that the rt* modules were in the error dump. I
don't remember seeing the option in the kernel config to choose SLAB
over SLUB. Where is it?
In init/Kconfig, aka "General setup", "Choose SLAB allocator". There
you can choose between SLAB, SLUB, and SLOB.

Got it, thanks.

This stability problem is particularly frustrating because I saw the
errors occurring on 2.6.29 which didn't have zram, so in theory, it
can't be directly zram related (and I've been running zram on my
SheevaPlug on 2.6.36.2 kernel for ages with much heavier loads).
>
I don't have zram either.

Indeed, but there could be some weird interraction going on. Do you have vmalloc= in your kernel boot parameters?

So I'm taking all my observations at the moment with a fist sized
grain of salt. What is weird, however, is that I seem to be running
completely stable today at 975mV SM1 set in board-paz00-power.c, and
it's warmer than it was yesterday.
You always need to remember that this value is just a maximum for the
regulator, it is not fixed. In your case, the regulator scales from
725mV to 975mV, in 50(?) mV steps.

Indeed, but with the performance governor set, it should never drop below the max.

The only other differences are:
1) Disabled zram swap (still have normal swap)
2) Changed vm.swappiness from 100 to 0
3) Unloaded rt* and related modules
4) Rebuilding the kernel (with -j4) instead of glibc

The obvious difference with 4) is that glibc compile takes a lot
more memory to compile than the kernel, which causes swapping. When
the kernel compile finishes if there are no errors, I'll try the
glibc building again. If that shakes it loose, the only thing I can
think of is the vmalloc kernel boot parameter which came from the
original Android setup (vmalloc=320M). I'm pretty sure this
shouldn't be needed, but it is vaguely plausible it is causing
issues under high memory pressure, at least in combination with
other things that I have running.

It's most likely not memory related, but a bug in voltage
scaling. I currently have DVFS disabled, and the build seems
to be running without errors, whereas it would error out after
a few minutes with DVFS enabled.

Is this still relevant with the performance governor enabled, though? I was seeing errors with the performance governor, which I would expect to cause the voltages to never get changed.

The commit 1f8100366e46c626becd71a34cdcf7976570ea11[1] for seaboard might
be interested, it reduces the slew rate to avoid voltage from going down
to quickly.

[1] http://gitorious.org/~marvin24/ac100/marvin24s-kernel/commit/1f8100366e46c626becd71a34cdcf7976570ea11

Hmm, interesting. This is already in Marc's git tree, but it's not been ported to paz00. I'll give it a shot, but it still doesn't quite explain all the observed data points, incluiding errors occurring while running with the performance governor.

Gordan


Follow ups

References