Thread Previous • Date Previous • Date Next • Thread Next |
On Fri, Aug 19, 2011 at 10:18:31AM +0100, Gordan Bobic wrote:As some of you may have already heard on the IRC channel, I had my AC100 suddenly become very unstable under load. When doing big compile jobs, the compiler would relatively regularly segfault or detect hardware errors, or errors it didn't think was hardware and invited me to post a bug report with pre-processed C file. None of these were reproducible (it would error out in a different place on different runs). So I figured I had duff hardware and got another one. This is a lot better, but I still get spurious, unreproducible errors like this every few hours (old one would error out up to a few times/hour if it was being hammered with compiling jobs for a few hours). Both of mine are the 10U models with Micron RAM. Now, either I am incredibly unlucky or something else is going on. What I would like to know is: 1) Do you use their AC100 for big compile jobs (e.g. the 2-day gcc compile)? 2) If 1), are you seeing random errors like what I'm describing?Yes, I am experiencing crashing and endlessly recursing GCCs when trying to build kernels on my 10V, on Debian armel, at least when compiling with multiple cores. With one core, I got a complete build (although this takes 4 hours then), at least after having the machine off for a day and then booting and starting to build directly. I also built a kernel in an armhf environment using 3-5 parallel jobs without a problem (in 2 hours build time).
So you are suggesting that hf platform is actually more stable? Are your results repeatable in terms of demonstrating that on hf it doesn't happen?
Furthermore, while building a kernel I tried to decompress a file; this failed on the first attempt, but succeeded on the second attempt.
Yes, definitely seen that here, too.
I ran some memory testing tools, but they did not find any problem.
Ditto, I ran many, many times and it hasn't found any issues, but it is generally not good for stress-testing.
My questions here being: (a) do you run a customly built kernel?
I tried with the old 2.6.29 (IIRC from the old Ubuntu5 tar ball last year), with 2.6.38.3+ and with my own patch 2.6.38.8+ (2.6.38.3+ + patches to 2.6.38.8). Same happens on all of them. I haven't tried 2.6.32 yet.
(b) do you use the binary nvidia driver?
Yes, but I have observed the instability without the driver being loaded (i.e. not starting xorg or in th ecase of 2.6.29, nvrm_daemon), so I don't think the binary driver is the issue here. It would be very hard for the binary Xorg driver to cause other programs to randomly crash.
If I recall correctly, the builds only succeeded sofar on systems without binary drivers. But I can be wrong.
I have definitely observed instability regardless of the binary drivers.
The obvious question I have now is that since there clearly are several people who have seen stability issues, why hasn't this been raised before? If we have hardware that is demonstrably marginal across model variangs (10U, 10V), how come nobody has kicked off about it before me?
If it turns out that AC100 is systematically suffering from duff, pre-over-overclocked hardware (as is fairly typical of nvidia - their chips generally cannot handle running at full load at default clocks for reasonable periods of time, and they have no margin for error at all, both in terms of default voltages and clock-speeds), it seems the effort going into it may well be wasted, at least until other similar hardware becomes available. I'm eagerly awaiting Jeremiah's report on whether is TrimSlice is exhibiting the same issues. I sincerely hope it isn't and that it's down to memory timings, since at least we can try to do something about those.
Gordan
Thread Previous • Date Previous • Date Next • Thread Next |