← Back to team overview

ac100 team mailing list archive

Re: Stability Under Load

 

On Fri, 19 Aug 2011 18:01:02 +0200, Julian Andres Klode <jak@xxxxxxxxxx> wrote:

>I ran some memory testing tools, but they did not find any
>problem.

Ditto, I ran many, many times and it hasn't found any issues, but it
is generally not good for stress-testing.

But I believe it shows us that memory itself is correct, and the
problem must be somewhere else.

It doesn't eliminate the possibility that the memory may be too overclocked. This is similar to OC testing on x86. I have run days of memtest86 without finding any problems only to have OCCT detect an OC-ing induced error in under 30 seconds. Memory testers aren't a harsh enough test to show up marginal components in my experience. So it could easily still be a memory timing issue.

It would be very hard for the binary Xorg driver to cause other
programs to randomly crash.

Part of the system memory is used by the display driver, so if
the kernel has a bug that it uses one of those portions of the
RAM despite it being used by the graphics system, then this could
explain it.

Hmm, that is plausible. But would that also exhibit when no driver other than the console FB is loaded?

The obvious question I have now is that since there clearly are
several people who have seen stability issues, why hasn't this been
raised before?

I raised the issue multiple times on IRC, but obviously only when
you were not there.

Ah, good to know. Perhaps this is worth a page on the Wiki, linked from the front page? This is something that is likely to be affecting a lot of users.

If it turns out that AC100 is systematically suffering from duff,
pre-over-overclocked hardware (as is fairly typical of nvidia -
their chips generally cannot handle running at full load at default
clocks for reasonable periods of time, and they have no margin for
error at all, both in terms of default voltages and clock-speeds),
it seems the effort going into it may well be wasted, at least until
other similar hardware becomes available. I'm eagerly awaiting
Jeremiah's report on whether is TrimSlice is exhibiting the same
issues. I sincerely hope it isn't and that it's down to memory
timings, since at least we can try to do something about those.

We could still underclock devices if needed.

I underclocked my old AC100 down to <= 700MHz using the power management governor, and it was still erroring out just the same. So this doesn't seem to be a clock-speed issue, unless something else is going out of whack at the same time (e.g. undervolding at all clock speeds)

Gordan


References