← Back to team overview

ac100 team mailing list archive

Re: Stability Under Load

 

On Fri, 19 Aug 2011 16:35:47 +0200, Julian Andres Klode <jak@xxxxxxxxxx> wrote:
On Fri, Aug 19, 2011 at 10:18:31AM +0100, Gordan Bobic wrote:
As some of you may have already heard on the IRC channel, I had my
AC100 suddenly become very unstable under load. When doing big
compile jobs, the compiler would relatively regularly segfault or
detect hardware errors, or errors it didn't think was hardware and
invited me to post a bug report with pre-processed C file. None of
these were reproducible (it would error out in a different place on
different runs). So I figured I had duff hardware and got another
one. This is a lot better, but I still get spurious, unreproducible
errors like this every few hours (old one would error out up to a
few times/hour if it was being hammered with compiling jobs for a
few hours). Both of mine are the 10U models with Micron RAM.

Now, either I am incredibly unlucky or something else is going on.
What I would like to know is:
1) Do you use their AC100 for big compile jobs (e.g. the 2-day gcc
compile)?
2) If 1), are you seeing random errors like what I'm describing?

Yes, I am experiencing crashing and endlessly recursing GCCs when
trying to build kernels on my 10V, on Debian armel, at least when
compiling with multiple cores. With one core, I got a complete
build (although this takes 4 hours then), at least after having
the machine off for a day and then booting and starting to build
directly.

I also built a kernel in an armhf environment using 3-5 parallel
jobs without a problem (in 2 hours build time).

So you are suggesting that hf platform is actually more stable? Are your results repeatable in terms of demonstrating that on hf it doesn't happen?

Furthermore, while building a kernel I tried to decompress a
file; this failed on the first attempt, but succeeded on the
second attempt.

Yes, definitely seen that here, too.

I ran some memory testing tools, but they did not find any
problem.

Ditto, I ran many, many times and it hasn't found any issues, but it is generally not good for stress-testing.

My questions here being:
  (a) do you run a customly built kernel?

I tried with the old 2.6.29 (IIRC from the old Ubuntu5 tar ball last year), with 2.6.38.3+ and with my own patch 2.6.38.8+ (2.6.38.3+ + patches to 2.6.38.8). Same happens on all of them. I haven't tried 2.6.32 yet.

  (b) do you use the binary nvidia driver?

Yes, but I have observed the instability without the driver being loaded (i.e. not starting xorg or in th ecase of 2.6.29, nvrm_daemon), so I don't think the binary driver is the issue here. It would be very hard for the binary Xorg driver to cause other programs to randomly crash.

If I recall correctly, the builds only succeeded sofar on systems
without binary drivers. But I can be wrong.

I have definitely observed instability regardless of the binary drivers.

The obvious question I have now is that since there clearly are several people who have seen stability issues, why hasn't this been raised before? If we have hardware that is demonstrably marginal across model variangs (10U, 10V), how come nobody has kicked off about it before me?

If it turns out that AC100 is systematically suffering from duff, pre-over-overclocked hardware (as is fairly typical of nvidia - their chips generally cannot handle running at full load at default clocks for reasonable periods of time, and they have no margin for error at all, both in terms of default voltages and clock-speeds), it seems the effort going into it may well be wasted, at least until other similar hardware becomes available. I'm eagerly awaiting Jeremiah's report on whether is TrimSlice is exhibiting the same issues. I sincerely hope it isn't and that it's down to memory timings, since at least we can try to do something about those.

Gordan


Follow ups

References