← Back to team overview

ac100 team mailing list archive

Stability Under Load

 

As some of you may have already heard on the IRC channel, I had my AC100 suddenly become very unstable under load. When doing big compile jobs, the compiler would relatively regularly segfault or detect hardware errors, or errors it didn't think was hardware and invited me to post a bug report with pre-processed C file. None of these were reproducible (it would error out in a different place on different runs). So I figured I had duff hardware and got another one. This is a lot better, but I still get spurious, unreproducible errors like this every few hours (old one would error out up to a few times/hour if it was being hammered with compiling jobs for a few hours). Both of mine are the 10U models with Micron RAM.

Now, either I am incredibly unlucky or something else is going on. What I would like to know is: 1) Do you use their AC100 for big compile jobs (e.g. the 2-day gcc compile)?
2) If 1), are you seeing random errors like what I'm describing?

On my old AC100, dropping the clock speed down to 700MHz using power management features didn't make a difference to stability. I haven't tested that on the new one.

My gut feeling at the moment is that the RAM could be over-timed so I'm going to try modifying the kernel code to relax the RAM timings by a notch.

The only competing idea is that Tegra2 comes pre-overclocked past the stable limits for 100% load for prolonged periods. This wouldn't surprise me either (Nvidia chips have proven unreliable in the past even at their default clock speeds, both the motherboard chipsets and GPUs), but I would like to think that Toshiba would have done some due dilligence testing of their product. For comparison, my SheevaPlug is compiling 24/7 for weeks at a time and has never errored out.

Any additional data points you guys can provide would be useful.

Gordan


Follow ups