ac100 team mailing list archive
-
ac100 team
-
Mailing list archive
-
Message #00174
Stability Under Load
As some of you may have already heard on the IRC channel, I had my
AC100 suddenly become very unstable under load. When doing big compile
jobs, the compiler would relatively regularly segfault or detect
hardware errors, or errors it didn't think was hardware and invited me
to post a bug report with pre-processed C file. None of these were
reproducible (it would error out in a different place on different
runs). So I figured I had duff hardware and got another one. This is a
lot better, but I still get spurious, unreproducible errors like this
every few hours (old one would error out up to a few times/hour if it
was being hammered with compiling jobs for a few hours). Both of mine
are the 10U models with Micron RAM.
Now, either I am incredibly unlucky or something else is going on. What
I would like to know is:
1) Do you use their AC100 for big compile jobs (e.g. the 2-day gcc
compile)?
2) If 1), are you seeing random errors like what I'm describing?
On my old AC100, dropping the clock speed down to 700MHz using power
management features didn't make a difference to stability. I haven't
tested that on the new one.
My gut feeling at the moment is that the RAM could be over-timed so I'm
going to try modifying the kernel code to relax the RAM timings by a
notch.
The only competing idea is that Tegra2 comes pre-overclocked past the
stable limits for 100% load for prolonged periods. This wouldn't
surprise me either (Nvidia chips have proven unreliable in the past even
at their default clock speeds, both the motherboard chipsets and GPUs),
but I would like to think that Toshiba would have done some due
dilligence testing of their product. For comparison, my SheevaPlug is
compiling 24/7 for weeks at a time and has never errored out.
Any additional data points you guys can provide would be useful.
Gordan
Follow ups