ac100 team mailing list archive
Mailing list archive
Re: Stability Under Load
On Fri, 19 Aug 2011 16:35:47 +0200, Julian Andres Klode
On Fri, Aug 19, 2011 at 10:18:31AM +0100, Gordan Bobic wrote:
As some of you may have already heard on the IRC channel, I had my
AC100 suddenly become very unstable under load. When doing big
compile jobs, the compiler would relatively regularly segfault or
detect hardware errors, or errors it didn't think was hardware and
invited me to post a bug report with pre-processed C file. None of
these were reproducible (it would error out in a different place on
different runs). So I figured I had duff hardware and got another
one. This is a lot better, but I still get spurious, unreproducible
errors like this every few hours (old one would error out up to a
few times/hour if it was being hammered with compiling jobs for a
few hours). Both of mine are the 10U models with Micron RAM.
Now, either I am incredibly unlucky or something else is going on.
What I would like to know is:
1) Do you use their AC100 for big compile jobs (e.g. the 2-day gcc
2) If 1), are you seeing random errors like what I'm describing?
Yes, I am experiencing crashing and endlessly recursing GCCs when
trying to build kernels on my 10V, on Debian armel, at least when
compiling with multiple cores. With one core, I got a complete
build (although this takes 4 hours then), at least after having
the machine off for a day and then booting and starting to build
I also built a kernel in an armhf environment using 3-5 parallel
jobs without a problem (in 2 hours build time).
So you are suggesting that hf platform is actually more stable? Are
your results repeatable in terms of demonstrating that on hf it doesn't
Furthermore, while building a kernel I tried to decompress a
file; this failed on the first attempt, but succeeded on the
Yes, definitely seen that here, too.
I ran some memory testing tools, but they did not find any
Ditto, I ran many, many times and it hasn't found any issues, but it is
generally not good for stress-testing.
My questions here being:
(a) do you run a customly built kernel?
I tried with the old 2.6.29 (IIRC from the old Ubuntu5 tar ball last
year), with 220.127.116.11+ and with my own patch 18.104.22.168+ (22.214.171.124+ +
patches to 126.96.36.199). Same happens on all of them. I haven't tried
(b) do you use the binary nvidia driver?
Yes, but I have observed the instability without the driver being
loaded (i.e. not starting xorg or in th ecase of 2.6.29, nvrm_daemon),
so I don't think the binary driver is the issue here. It would be very
hard for the binary Xorg driver to cause other programs to randomly
If I recall correctly, the builds only succeeded sofar on systems
without binary drivers. But I can be wrong.
I have definitely observed instability regardless of the binary
The obvious question I have now is that since there clearly are several
people who have seen stability issues, why hasn't this been raised
before? If we have hardware that is demonstrably marginal across model
variangs (10U, 10V), how come nobody has kicked off about it before me?
If it turns out that AC100 is systematically suffering from duff,
pre-over-overclocked hardware (as is fairly typical of nvidia - their
chips generally cannot handle running at full load at default clocks for
reasonable periods of time, and they have no margin for error at all,
both in terms of default voltages and clock-speeds), it seems the effort
going into it may well be wasted, at least until other similar hardware
becomes available. I'm eagerly awaiting Jeremiah's report on whether is
TrimSlice is exhibiting the same issues. I sincerely hope it isn't and
that it's down to memory timings, since at least we can try to do
something about those.