← Back to team overview

ac100 team mailing list archive

Re: Stability Under Load

 

On Fri, 19 Aug 2011 15:55:57 +0200, Marc Dietrich <marvin24@xxxxxx> wrote:
Hi Gordan,

Am Freitag 19 August 2011, 11:18:31 schrieb Gordan Bobic:
 As some of you may have already heard on the IRC channel, I had my
AC100 suddenly become very unstable under load. When doing big compile
[...]
My gut feeling at the moment is that the RAM could be over-timed so I'm going to try modifying the kernel code to relax the RAM timings by a
 notch.

we are not touching RAM timings so far on kernel 2.6.38. It may be
possible that the original kernels does so.

Well, my plan was to up the timings in arch/arm/mach-tegra/board-paz00-memory.c. Can you confirm whether the values there are in units of clock cycles? Or is it ns? Also which line corresponds to CAS? I can see RAS, RC, RCD, RRD, RFC, but can't see a value for CAS, which is, at least in theory, the most imporant one.

The only competing idea is that Tegra2 comes pre-overclocked past the
 stable limits for 100% load for prolonged periods. This wouldn't
surprise me either (Nvidia chips have proven unreliable in the past even at their default clock speeds, both the motherboard chipsets and GPUs),
 but I would like to think that Toshiba would have done some due
dilligence testing of their product. For comparison, my SheevaPlug is
 compiling 24/7 for weeks at a time and has never errored out.

It could also be related to power supply. What we do is modifing the voltage
supplies for serveral power sources. I had the feeling, that Toshiba
undervoltaged some CPU supplies in order to save energy (compared to other boards). So I increased SM1 from 1V to 1.2V which may have been wrong.

How did you do this?

It would be nice if you could test a .32 based kernel and see if it
also happens there. Also you could try your new model.

I haven't tried 2.6.32 because I couldn't find one at the time, but I tried the old 2.6.29 and 2.6.38, and the instability on my old AC100 was the same. Haven't tried it on the new one yet. Do you think 2.6.32 could be behaving differently to both of those? If so, why? Where can I get the Tegra-patched 2.6.32 kernel?

Finially, it could also be possible that this is a design bug, which
may be hard to solve for us.

Plausible. At the moment, though, I'm most interested in finding out whether other people are seeing similar issues under prolonged load. It would also be useful to establish whether this might be specific to the RAM chips used (e.g. Micron chips erroring, but Hynix being OK)

Gordan


Follow ups

References