ac100 team mailing list archive

Thread
Date

Re: Stability Under Load

To: Julian Andres Klode <jak@xxxxxxxxxx>
From: Gordan Bobic <gordan@xxxxxxxxxx>
Date: Fri, 19 Aug 2011 15:53:09 +0100
Cc: ac100@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20110819143547.GA8100@jak-linux.org>
User-agent: Roundcube Webmail/0.4.2

On Fri, 19 Aug 2011 16:35:47 +0200, Julian Andres Klode<jak@xxxxxxxxxx> wrote:

On Fri, Aug 19, 2011 at 10:18:31AM +0100, Gordan Bobic wrote:

As some of you may have already heard on the IRC channel, I had my
AC100 suddenly become very unstable under load. When doing big
compile jobs, the compiler would relatively regularly segfault or
detect hardware errors, or errors it didn't think was hardware and
invited me to post a bug report with pre-processed C file. None of
these were reproducible (it would error out in a different place on
different runs). So I figured I had duff hardware and got another
one. This is a lot better, but I still get spurious, unreproducible
errors like this every few hours (old one would error out up to a
few times/hour if it was being hammered with compiling jobs for a
few hours). Both of mine are the 10U models with Micron RAM.

Now, either I am incredibly unlucky or something else is going on.
What I would like to know is:
1) Do you use their AC100 for big compile jobs (e.g. the 2-day gcc
compile)?
2) If 1), are you seeing random errors like what I'm describing?


Yes, I am experiencing crashing and endlessly recursing GCCs when
trying to build kernels on my 10V, on Debian armel, at least when
compiling with multiple cores. With one core, I got a complete
build (although this takes 4 hours then), at least after having
the machine off for a day and then booting and starting to build
directly.

I also built a kernel in an armhf environment using 3-5 parallel
jobs without a problem (in 2 hours build time).

So you are suggesting that hf platform is actually more stable? Areyour results repeatable in terms of demonstrating that on hf it doesn'thappen?

Furthermore, while building a kernel I tried to decompress a
file; this failed on the first attempt, but succeeded on the
second attempt.


Yes, definitely seen that here, too.

I ran some memory testing tools, but they did not find any
problem.

Ditto, I ran many, many times and it hasn't found any issues, but it isgenerally not good for stress-testing.

My questions here being:
  (a) do you run a customly built kernel?

I tried with the old 2.6.29 (IIRC from the old Ubuntu5 tar ball lastyear), with 2.6.38.3+ and with my own patch 2.6.38.8+ (2.6.38.3+ +patches to 2.6.38.8). Same happens on all of them. I haven't tried2.6.32 yet.

  (b) do you use the binary nvidia driver?

Yes, but I have observed the instability without the driver beingloaded (i.e. not starting xorg or in th ecase of 2.6.29, nvrm_daemon),so I don't think the binary driver is the issue here. It would be veryhard for the binary Xorg driver to cause other programs to randomlycrash.

If I recall correctly, the builds only succeeded sofar on systems
without binary drivers. But I can be wrong.

I have definitely observed instability regardless of the binarydrivers.

The obvious question I have now is that since there clearly are severalpeople who have seen stability issues, why hasn't this been raisedbefore? If we have hardware that is demonstrably marginal across modelvariangs (10U, 10V), how come nobody has kicked off about it before me?

If it turns out that AC100 is systematically suffering from duff,pre-over-overclocked hardware (as is fairly typical of nvidia - theirchips generally cannot handle running at full load at default clocks forreasonable periods of time, and they have no margin for error at all,both in terms of default voltages and clock-speeds), it seems the effortgoing into it may well be wasted, at least until other similar hardwarebecomes available. I'm eagerly awaiting Jeremiah's report on whether isTrimSlice is exhibiting the same issues. I sincerely hope it isn't andthat it's down to memory timings, since at least we can try to dosomething about those.


Gordan

Follow ups

Re: Stability Under Load
From: Julian Andres Klode, 2011-08-19

References

Stability Under Load
From: Gordan Bobic, 2011-08-19
Re: Stability Under Load
From: Julian Andres Klode, 2011-08-19