← Back to team overview

registry team mailing list archive

[Bug 438136] Re: palimpsest bad sectors false positive

 

Launchpad has imported 11 comments from the remote bug at
http://bugs.freedesktop.org/show_bug.cgi?id=25772.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2009-12-23T01:11:19+00:00 Jelot-freedesktop wrote:

A comment in the code say:

/* We use log2(n_sectors) as a threshold here. We had to pick
 * something, and this makes a bit of sense, or doesn't it? */

this means:

128GB = 2^37 Bytes -> log2(2^28) = 28 sectors
1TB = 2^40 Bytes -> log2(2^31) = 31 sectors
8TB = 2^43 Bytes -> log2(2^34) = 34 sectors

I think that this is a unlucky heuristic.

The meaning of raw value is vendor specific.
Could have sense if BAD_SECTOR_MANY is calculated like:

(worst value - threshold value) <= 5 ?

obviously this is only an example

Reply at: https://bugs.launchpad.net/libatasmart/+bug/438136/comments/98

------------------------------------------------------------------------
On 2009-12-23T05:15:11+00:00 Lennart-poettering wrote:

The entire SMART attribute business is highly vendor dependant since
there is no officially accepted spec about SMART attribute decoding. (It
never became an official standard, all it ever was was a draft that was
later on withdrawn) Fortunately on almost all drives the raw data of
quite a few fields can be decoded the same way. In libatasmart we try to
include the decoding of fields where it makes sense and is commonly
accepted.

OTOH the non-raw fields (i.e. "current" and "worst") encode the
information about the raw number of sectors (for sector related
attributes) in a way that we cannot determine the actual number of
sectors anymore.

The reason for this extra threshold we apply here is that we wanted
vendor-independent health checking. i.e. as long as we can trust the
number of raw bad sectors the drive reports we can compare that with a
threshold that is not fiddled with by the vendor to make his drives look
better.

The reason I picked log2() here is simply that we do want to allow more
bad sectors on bigger drives than on small ones. But a linearly related
threshold seemed to increase too quickly, so the next choice was
logarithmic.

Do you have any empiric example where the current thresholds do not work
as they should?

Reply at: https://bugs.launchpad.net/libatasmart/+bug/438136/comments/99

------------------------------------------------------------------------
On 2009-12-28T08:38:04+00:00 Stephen-boddy wrote:

Please check the associated skdump save file. This is an old 20GB laptop
drive. In the latest Ubuntu 9.10 they ship with 0.16 of libatasmart. I
think this drive is incorrectly flagged as failing, because the lib
relies on the raw value being a single raw48 value. This then looks like
very many (262166) bad blocks.

Using "smartctl -a /dev/sda" I get the following extracts:

SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       262166
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       4

If I use the -v 5,raw8 option
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0 0 0 4 0 22 

If I use the -v 5,raw16 option
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0 4 22

The attribute is being read as raw48, which in this case looks to be
completely wrong. Using the different raw# value seems to tie in with
attribute 196.

It could be argued that if you cannot rely on the format of the raw
value, you should not base warnings off it, and only use the normalized,
worst and threshold values. I'm technical, and I damn near junked a
relatives old but still serviceable laptop because of this.

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/104

------------------------------------------------------------------------
On 2009-12-28T08:39:08+00:00 Stephen-boddy wrote:

Created an attachment (id=32330)
skdump of harddrive

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/105

------------------------------------------------------------------------
On 2009-12-29T04:17:41+00:00 Jelot-freedesktop wrote:

(In reply to comment #1)
> The reason I picked log2() here is simply that we do want to allow more bad
> sectors on bigger drives than on small ones. But a linearly related threshold
> seemed to increase too quickly, so the next choice was logarithmic.
> 
> Do you have any empiric example where the current thresholds do not work as
> they should?
> 

For convenience I use kibibyte, mebibyte, gibibyte ...

128 GiB = 2^37 -> log2(2^37/512) = log2(2^37/2^9) = 28 sectors

For an HDD of 128 GiB (2^37 Bytes) the calculated threshold value is 28
sectors (14336 Bytes = 14 KiB), isn't it too low?

For an HDD of 1 TiB (2^40 Bytes) the calculated threshold value is 31
sectors (15872 Bytes = 15.5 KiB) ...

For an  hypothetical HDD of 1 PiB (2^50 Bytes, 1024 tebibyte) the
calculated threshold is only 41 sectors (20992 Bytes = 20.5 KiB) ...

If we do want to allow more bad sectors on bigger drives than on small
ones, IMHO this isn't a good heuristic.

Difference between HDD of 128 GiB and HDD of 8 TiB is only 6 sectors (3
KiB)

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/107

------------------------------------------------------------------------
On 2009-12-30T06:12:33+00:00 Jelot-freedesktop wrote:

I forgotten to say that this bug report and the enhancement requested in
Bug #25773 is due to Launchpad Bug 438136
<https://bugs.launchpad.net/ubuntu/+source/libatasmart/+bug/438136?comments=all>

On launchpad there are also some screenshots of palimpsest that show the
failing hard disk with relatively  few bad sectors or with raw value
with probably different format (there are some 65537 65539 65551 65643
and similar number of bad sectors)

Some example:

117 bad sectors (58.5 KiB) on 1000GB HDD
<http://launchpadlibrarian.net/32604239/palimpsest-screenshot.png>

66 bad sectors (33 KiB) on 200GB HDD
<http://launchpadlibrarian.net/34794631/Screenshot-SMART%20Data.png>

466 bad sectors (233 KiB) on 1500GB HDD
<http://launchpadlibrarian.net/34991157/Screenshot.png>

65 bad sectors (32.5 KiB) on 120GB HDD (all current pending sectors"
<http://launchpadlibrarian.net/35201129/Pantallazo-Datos%20SMART.png>

54 bad sectors (27 KiB) on 169GB HDD
<http://launchpadlibrarian.net/36115988/Screenshot.png>

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/108

------------------------------------------------------------------------
On 2010-03-19T04:00:15+00:00 Martin Pitt wrote:

The bigger problem of this is (as you already mentioned) that the raw
value is misparsed way too often. Random examples from bug reports:

  http://launchpadlibrarian.net/34574037/smartctl.txt
5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       327697

  http://launchpadlibrarian.net/35971054/smartctl_tests.log
5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       65542

  http://launchpadlibrarian.net/36599746/smartctl_tests-deer.log
5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       65552

  https://bugzilla.redhat.com/attachment.cgi?id=382378
5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       655424

  https://bugzilla.redhat.com/show_bug.cgi?id=506254
reallocated-sector-count    100/100/  5   FAIL    1900724 sectors Prefail 
Online 

It seems that "no officially accepted spec about SMART attribute
decoding" also hits here in the sense of that way too many drives get
the raw counts wrong. In all the 30 or so logs that I looked at in the
various Launchpad/RedHat/fd.o bug reports related to this I didn't see
an implausible value of the normalized values, though.

I appreciate the effort of doing vendor independent bad blocks checking,
but a lot of people get tons of false alarms due to that, and thus won't
believe it any more if there is really a disk failing some day.

My feeling is that a more cautious approach would be to use the
normalized value vs. treshold for the time being, and use the raw values
if/when that can be made more reliable (then we should use something in
between logarithmic and linear, though, since due to sheer
probabilities, large disks will have more bad sectors and also more
reserve sectors than small ones).

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/144

------------------------------------------------------------------------
On 2010-03-19T04:27:33+00:00 Martin Pitt wrote:

Created an attachment (id=34234)
smart blob with slightly broken sectors

BTW, I use this smart blob for playing around and testing, which is a
particularly interesting one: It has a few bad sectors (correctly
parsed), but not enough yet to be below the vendor specified threshold.

  5 reallocated-sector-count     77     1    63   1783 sectors 0xf70600000000 prefail online  yes  no  
197 current-pending-sector       83     6     0   1727 sectors 0xbf0600000000 old-age offline n/a  n/a 

So this can be loaded into skdump or udisks for testing the desktop
integration all the way through:

$ sudo udisks --ata-smart-refresh /dev/sda --ata-smart-simulate
/tmp/smart.blob

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/145

------------------------------------------------------------------------
On 2010-03-19T07:02:09+00:00 Martin Pitt wrote:

Created an attachment (id=34242)
Drop our own "many bad sectors" heuristic

This patch just uses the standard "compare normalized value against
treshold". I know that it's not necessarily how you really want it to
work, but it's a pragmatic solution to avoid all those false positives,
which don't help people either.

So of course feel free to entirely ignore it, but at least I want to
post it here for full disclosure. (I'll apply it to Debian/Ubuntu, we
have to get a release out).

This patch is against the one in bug 26834.

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/146

------------------------------------------------------------------------
On 2010-03-19T07:05:13+00:00 Martin Pitt wrote:

Oh, forgot: I compared

  for i in blob-examples/*; do echo "-- $i"; ./skdump --load=$i; done >
/tmp/atasmart-test.out

before and after, and get two differences like

-^[[1mOverall Status: BAD_SECTOR_MANY^[[0m
+^[[1mOverall Status: BAD_SECTOR^[[0m

The first one is against blob-examples/Maxtor_96147H8--BAC51KJ0:
 5 reallocated-sector-count    226   226    63   69 sectors  0x450000000000 prefail online  yes  yes 

and the second one against blob-examples/WDC_WD5000AAKS--00TMA0-12.01C01

  5 reallocated-sector-count    192   192   140   63 sectors
0x3f0000000000 prefail online  yes  yes

so under the premise of changing the evaluation to use the normalized
numbers those are correct and expected changes.

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/147

------------------------------------------------------------------------
On 2010-07-04T02:09:56+00:00 cowbutt wrote:

(In reply to comment #1)

> The reason I picked log2() here is simply that we do want to allow more bad
> sectors on bigger drives than on small ones. But a linearly related threshold
> seemed to increase too quickly, so the next choice was logarithmic.
> 
> Do you have any empiric example where the current thresholds do not work as
> they should?

According to http://www.seagate.com/ww/v/index.jsp?locale=en-US&name
=SeaTools_Error_Codes_-
_Seagate_Technology&vgnextoid=d173781e73d5d010VgnVCM100000dd04090aRCRD
(which I first read about 18 months ago, when 1.5TB drives were brand
new), "Current disk drives contain *thousands* [my emphasis] of spare
sectors which are automatically reallocated if the drive senses
difficulty reading or writing". Therefore, it is my belief that your
heuristic is off by somewhere between one and two orders of magnitude as
your heuristic only allows for 30 bad sectors on a 1TB drive (Seagate's
article would imply it has at least 2000 spare sectors - and maybe more
- of which 30 are only 1.5%).

As you say, though, this is highly manufacturer- and model-dependent;
Seagate's drives might be designed with very many more spare sectors
than other manufacturers' drives. The only sure-fire way to interpret
the SMART attributes is to compare the cooked value with the vendor-set
threshold for that attribute.

If you are insistent upon doing something with the raw reallocated
sector count attribute, I believe it would be far more useful to alert
when it changes, or changes by a large number of sectors in a short
period of time.

Reply at:
https://bugs.launchpad.net/libatasmart/+bug/438136/comments/167


** Changed in: libatasmart
   Importance: Unknown => Medium

-- 
palimpsest bad sectors false positive
https://bugs.launchpad.net/bugs/438136
You received this bug notification because you are a member of Registry
Administrators, which is the registrant for Fedora.