← Back to team overview

registry team mailing list archive

[Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

 

Launchpad has imported 20 comments from the remote bug at
http://bugs.freedesktop.org/show_bug.cgi?id=25673.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2009-12-16T06:59:31+00:00 Scott James Remnant wrote:

We have many reports of the libatasmart code causing stalls, HSM
Violations and even death of SSDs.  Particularly SuperTalent ones, but
also those found in my netbooks.

# sleep 120; logger devkit-disks-probe-ata-smart; /lib/udev/devkit-
disks-probe-ata-smart /dev/sda; sleep 120; logger done

And I get this in syslog (repeatably):

Dec 14 11:12:01 unus logger: devkit-disks-probe-ata-smart
Dec 14 11:12:35 unus kernel: [ 7734.000130] ata2: lost interrupt (Status 0x58)
Dec 14 11:12:35 unus kernel: [ 7734.000217] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 14 11:12:35 unus kernel: [ 7734.000232] ata2.00: BMDMA stat 0x4
Dec 14 11:12:35 unus kernel: [ 7734.000264] ata2.00: cmd ca/00:08:08:8b:54/00:00:00:00:00/e3 tag 0 dma 4096 out
Dec 14 11:12:35 unus kernel: [ 7734.000270] res 58/00:08:08:8b:54/00:00:00:00:00/e3 Emask 0x2 (HSM violation)
Dec 14 11:12:35 unus kernel: [ 7734.000284] ata2.00: status: { DRDY DRQ }
Dec 14 11:12:35 unus kernel: [ 7734.000343] ata2: soft resetting link
Dec 14 11:12:35 unus kernel: [ 7734.208576] ata2.00: configured for UDMA/66
Dec 14 11:12:35 unus kernel: [ 7734.208618] ata2: EH complete
Dec 14 11:14:01 unus logger: done

The problem has also been confirmed in Fedora 12.

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/122

------------------------------------------------------------------------
On 2009-12-16T06:59:57+00:00 Scott James Remnant wrote:

Kernel bugzilla bug for the same issue (URL above is the Launchpad bug)

http://bugzilla.kernel.org/show_bug.cgi?id=14583

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/123

------------------------------------------------------------------------
On 2009-12-16T07:49:45+00:00 Lennart-poettering wrote:

Hmm, lacking access to the hw in question I am not sure what I can do
about this.

What surprises me a bit is that this only appeared so very recently. Is
this triggered by some interplay with some specific kernel version?

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/124

------------------------------------------------------------------------
On 2009-12-16T08:40:51+00:00 Scott James Remnant wrote:

Most distros only switched to using your code recently; previously we've
all been using smartmontools and the like which don't cause this
problem.

The LP bug has the differencing straces between the two if that's
helpful?

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/125

------------------------------------------------------------------------
On 2009-12-17T00:26:11+00:00 Lennart-poettering wrote:

(In reply to comment #3)
> Most distros only switched to using your code recently; previously we've all
> been using smartmontools and the like which don't cause this problem.

Is it actually verified that this doesn't happen with smartmontools? I
mean, smartmontools in contrast to libatasmart does not issue commands
that early after initialization/hotplug? So, is it verified that the
problem is with the way libatasmart issues its commands and not simply
due to the context those commands are executed in?

> The LP bug has the differencing straces between the two if that's
helpful?

I only see a lot of noise in that bug report, could you point me tto the
two straces?

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/129

------------------------------------------------------------------------
On 2009-12-17T00:28:12+00:00 Lennart-poettering wrote:

(In reply to comment #3)
> Most distros only switched to using your code recently; 

The simple fact is that rawhide (and the ubuntu betas) had this code for
months already, and we got quite a few bug reports, but never something
about this issue. This issue only appeared a couple of weeks back, and
hence I am wondering if something else changed in that time, because
libatasmart didn't.

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/130

------------------------------------------------------------------------
On 2009-12-17T22:15:13+00:00 Andrewnz-simpson wrote:

(In reply to comment #5)
> The simple fact is that rawhide (and the ubuntu betas) had this code for months
> already, and we got quite a few bug reports, but never something about this
> issue. This issue only appeared a couple of weeks back, and hence I am
> wondering if something else changed in that time, because libatasmart didn't.
> 

The Ubuntu bug report goes back to early October, however the link with
libatasmart was only made very recently.

There is no interplay with any specific kernel version:  The bug has
been confirmed on 2.6.28, 2.6.29, 2.6.30, 2.6.31 and 2.6.32.  Both
Ubuntu patched versions and mainline kernels are affected.

The bug has been confirmed as libatasmart only; testing has shown that
smartmontools does not give the same problem.  Early initialisation is a
possible issue, though the problem can be readily reproduced at any
time.

Comment #129 of the Ubuntu Bug Report is well worth reading, because it
seems to be isolating the bug.

I will attach the straces from the Ubuntu Bug Report to this report

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/135

------------------------------------------------------------------------
On 2009-12-17T22:27:14+00:00 Andrewnz-simpson wrote:

Created an attachment (id=32167)
Trace from libatasmart

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/136

------------------------------------------------------------------------
On 2009-12-17T22:27:46+00:00 Andrewnz-simpson wrote:

Created an attachment (id=32168)
Trace from smartctl

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/137

------------------------------------------------------------------------
On 2009-12-18T00:17:37+00:00 Lennart-poettering wrote:

(In reply to comment #6)

> 
> Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> to be isolating the bug.

I don't think so. That proposed patch is bogus, identify_valid is FALSE
unless set to TRUE anyway.

Also, supposedly SMART does work with smartmontools, just not with
libatasmart, right? That comment suggests that SMART would not work at
all with those SSDs.

Andrew, do you have one of the SSDs affected? Could you step through the
code and figure out exactly which command triggers the problem?

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/139

------------------------------------------------------------------------
On 2009-12-18T00:45:56+00:00 Andrewnz-simpson wrote:

(In reply to comment #9)

> 
> Andrew, do you have one of the SSDs affected? Could you step through the code
> and figure out exactly which command triggers the problem?
> 

You would have to give me very detailed instructions as to how to do it.
Programming in C is not one of my skills and I don't have a programming
background.

More realistically, there are at least a couple of people on the Ubuntu
Bug list that would know how to do this.  Is it worth putting out a
query?

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/140

------------------------------------------------------------------------
On 2009-12-18T01:00:32+00:00 Alan Pope wrote:

In terms of how long ago this bug has existed, I originally filed a bug
on this issue back in June '09. So it's existed long before 9.10
(October) released.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/387272

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/141

------------------------------------------------------------------------
On 2009-12-18T10:00:53+00:00 Jelot-freedesktop wrote:

(In reply to comment #9)
> (In reply to comment #6)
> 
> > 
> > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > to be isolating the bug.
> 
> I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> set to TRUE anyway.

I'm the author of comment #129 in launchpad
<https://bugs.launchpad.net/ubuntu/+source/libatasmart/+bug/445852/comments/129>

I'm quite a beginner with c, but I know that if a variable is not
initialized its value is garbage.

I don't find how d->identify_valid is zeroed or setted FALSE. Obviously,
I could be missed it.

> 
> Also, supposedly SMART does work with smartmontools, just not with libatasmart,
> right? That comment suggests that SMART would not work at all with those SSDs.

I don't know internals of smartmoontools... I have read only some pages
of this pdf (2.7MB) http://www.t10.org/t13/project/d1321r3-ATA-
ATAPI-5.pdf

on pag 52 is reported:

[quote]
Devices that implement the PACKET Command feature set shall not implement the SMART feature set as described in this subclause. 
Devices that implement the PACKET Command feature set and SMART shall implement SMART as defined by the command packet set implemented by the device.
[/quote]

and on page 196 and subsequent is reported:

[quote]
SMART feature set.
  −  Mandatory when the SMART feature set is implemented.
  −  Use prohibited when the PACKET Command feature set is implemented.
[/quote]

I don't know how is implemented SMART for command packet set and I don't
know if this sdd implements command packet set, but this *could be* the
problem (IMHO)

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/142

------------------------------------------------------------------------
On 2009-12-19T01:45:16+00:00 Lennart-poettering wrote:

(In reply to comment #12)
> (In reply to comment #9)
> > (In reply to comment #6)
> > 
> > > 
> > > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > > to be isolating the bug.
> > 
> > I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> > set to TRUE anyway.
> 
> I'm the author of comment #129 in launchpad
> <https://bugs.launchpad.net/ubuntu/+source/libatasmart/+bug/445852/comments/129>
> 
> I'm quite a beginner with c, but I know that if a variable is not initialized
> its value is garbage.
> 
> I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I
> could be missed it.

The initial calloc() call for allocating the SkDisk structure does the
zero initialization.

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/153

------------------------------------------------------------------------
On 2009-12-19T04:33:43+00:00 Jelot-freedesktop wrote:

(In reply to comment #13)
> (In reply to comment #12)
> > I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I
> > could be missed it.
> 
> The initial calloc() call for allocating the SkDisk structure does the zero
> initialization.
> 

Uh... thanks for clarification and sorry for wasting your time.

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/154

------------------------------------------------------------------------
On 2009-12-22T21:09:43+00:00 4280829-noduck wrote:

(In reply to comment #9)
> (In reply to comment #6)
> 
> > 
> > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > to be isolating the bug.
> 
> I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> set to TRUE anyway.
> 
> Also, supposedly SMART does work with smartmontools, just not with libatasmart,
> right? That comment suggests that SMART would not work at all with those SSDs.
> 
> Andrew, do you have one of the SSDs affected? Could you step through the code
> and figure out exactly which command triggers the problem?
> 

I ran gdb on devkit-disks-probe-ata-smart/libatasmart. I found that it
is the ioctl call from disk_smart_read_thresholds call that triggers the
HSM violation in the kernel log (disk_smart_read_thresdholds calls
disk_command with the argument SK_SMART_COMMAND_READ_THRESHOLDS), which
calls disk_passthrough_16_command, which calls sg_io, which does an
ioctl.

There is one earlier call to disk_command (with
SK_ATA_COMMAND_IDENTIFY_DEVICE), but that does not the trigger the HSM
violation in the kernel log. So I believe it is the way that the SSD
reacts to the READ_THRESHOLDS command that throws off the kernel.

Raf.

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/155

------------------------------------------------------------------------
On 2010-01-07T07:23:51+00:00 Jelot-freedesktop wrote:

In Karmic there is a new stable kernel 2.6.31-17.54
<https://bugs.launchpad.net/ubuntu/+source/linux/+bug/480144> that refer
to upstream kernel 2.6.31.6; in the changelog
<http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.31.6> is
mentioned this commit 9982364654c186acd48c3070dcf6a76c69e540cc with this
description:

[quote]
commit 9982364654c186acd48c3070dcf6a76c69e540cc
Author: Tejun Heo <tj@xxxxxxxxxx>
Date:   Fri Oct 16 13:00:51 2009 +0900

libata: fix internal command failure handling
    
commit f4b31db92d163df8a639f5a8c8633bdeb6e8432d upstream.
    
When an internal command fails, it should be failed directly without invoking EH.  In the original implemetation, this was accomplished by letting internal command bypass failure handling in ata_qc_complete(). However, later changes added post-successful-completion handling to that code path and the success path is no longer adequate as internal command failure path.  One of the visible problems is that internal command failure due to timeout or other freeze conditions would spuriously trigger WARN_ON_ONCE() in the success path.

This patch updates failure path such that internal command failure handling is contained there.
[/quote]

Could be related to this bug?

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/162

------------------------------------------------------------------------
On 2010-02-18T13:14:32+00:00 4280829-noduck wrote:

(In reply to comment #15)
> I ran gdb on devkit-disks-probe-ata-smart/libatasmart.

Was this information helpful? If not, can you let me know how I can
assist in fixing this bug?

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/178

------------------------------------------------------------------------
On 2010-02-24T08:19:45+00:00 4280829-noduck wrote:

(In reply to comment #17)
> Was this information helpful? If not, can you let me know how I can assist in
> fixing this bug?

For those still following: disabling smart support on the device
prevents the second (dangerous) ioctl, and as a result no more HSM
violations.

smartctl --smart off /dev/sda

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/187

------------------------------------------------------------------------
On 2010-03-26T02:20:09+00:00 Martin Pitt wrote:

I got ssh access to an affected machine and finally tracked this down. I
also compared it to the ioctls that smartmontools do.

My raw notes with all the ioctl stracing, bisecting, etc. is in
https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202
.

Summary: It seems READ_THRESHOLDS without READ_DATA (or a different
ioctl like RETURN_STATUS) causes this problem, the drive "wants" to send
more data which is never flushed. Possible explanation:
https://bugzilla.kernel.org/show_bug.cgi?id=14583#c25

http://git.0pointer.de/?p=libatasmart.git;a=commitdiff;h=a223a4f6277a9f006b722b13671d5292dc6339bb
fixed this more or less inadvertetly, which explains why we don't see
that problem on our development releases.

Quite obviously from the commit, sk_disk_open() called sk_disk_smart_read_thresholds(), but not sk_disk_smart_read_data().
udisks-probe-ata-smart and skdump --can-smart just call sk_disk_open() and sk_disk_smart_is_available() (the latter does not do any I/O itself, just tests a flag).

So while a223a4 fixes this for the "common" use cases, there might still
be situations where thresholds are read, but not the values. Let's look
where init_smart() (the only place reading thresholds) is called:

 * sk_disk_smart_read_data(): OK, does READ_DATA

 * sk_disk_smart_status(): Does SK_SMART_COMMAND_RETURN_STATUS after
init_smart(), confirmed to work

 * sk_disk_smart_self_test(): OK, calls sk_disk_smart_read_data()

So right now, all code paths work.

However, a potential robust solution might be to make init_smart() call
sk_disk_smart_read_data() right after sk_disk_smart_read_thresholds().
This would cause data_is_valid to already be TRUE for self_test() (and
thus not change behaviour). sk_disk_smart_read_data() could test the
flag to avoid reading it twice. For sk_disk_smart_status() this would
mean to have an additional unused READ_DATA call, though, but that might
not hurt too much.

Reply at: https://bugs.launchpad.net/easypeasy-
project/+bug/445852/comments/221


** Changed in: libatasmart
   Importance: Unknown => Critical

-- 
devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death
https://bugs.launchpad.net/bugs/445852
You received this bug notification because you are a member of Registry
Administrators, which is the registrant for Fedora.