← Back to team overview

kernel-packages team mailing list archive

[Bug 539467] Re: SATA link power management causes disk errors and corruption

 

This issue appears to be present again on kernsl 3.13 (all) 3.16 (all)
and 3.17 (all)

upon shifting sata link power from min_power state to max_performance
state all kernels report various forms of this error:

[   45.200582] ata3.00: exception Emask 0x10 SAct 0x8000 SErr 0x50000 action 0xe frozen
[   45.200586] ata3.00: irq_stat 0x00400000, PHY RDY changed
[   45.200589] ata3: SError: { PHYRdyChg CommWake }
[   45.200592] ata3.00: failed command: WRITE FPDMA QUEUED
[   45.200596] ata3.00: cmd 61/e8:78:00:3f:48/00:00:04:00:00/40 tag 15 ncq 118784 out
[   45.200596]          res 40/00:7c:00:3f:48/00:00:04:00:00/40 Emask 0x10 (ATA bus error)
[   45.200597] ata3.00: status: { DRDY }
[   45.200601] ata3: hard resetting link
[   45.925051] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   45.925911] ata3.00: configured for UDMA/133
[   45.941016] ahci 0000:00:1f.2: port does not support device sleep
[   45.941029] ata3: EH complete

With the current 3.13 kernel reporting the most severe errors of block
write failures, etc.

The machine this is being tested on is an A05 bios Dell XPS13 (9333)

[    2.288104] ata3.00: ATA-8: LITEONIT LMT-256L9M-11 MSATA 256GB, HM8110B, max UDMA/133
[    2.288554] scsi 2:0:0:0: Direct-Access     ATA      LITEONIT LMT-256 10B  PQ: 0 ANSI: 5

As this machine is brand new, it's possible that the HW is actually
failing, however SMART doesn't indicate any problems with the block
device

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.17.0-031700-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     LITEONIT LMT-256L9M-11 MSATA 256GB
Serial Number:    TW0N42H75508548P1854
Firmware Version: HM8110B
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 4a
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 10 13:39:25 2014 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (   10) seconds.
Offline data collection
capabilities:                    (0x15) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering                                                                                                                                                                     
                                        power-saving mode.                                                                                                                                                                                   
                                        Supports SMART auto save timer.                                                                                                                                                                      
Error logging capability:        (0x01) Error logging supported.                                                                                                                                                                             
                                        General Purpose Logging supported.                                                                                                                                                                   
Short self-test routine                                                                                                                                                                                                                      
recommended polling time:        (   1) minutes.                                                                                                                                                                                             
Extended self-test routine                                                                                                                                                                                                                   
recommended polling time:        (  10) minutes.                                                                                                                                                                                             
SCT capabilities:              (0x003d) SCT Status supported.                                                                                                                                                                                
                                        SCT Error Recovery Control supported.                                                                                                                                                                
                                        SCT Feature Control supported.                                                                                                                                                                       
                                        SCT Data Table supported.                                                                                                                                                                            
                                                                                                                                                                                                                                             
SMART Attributes Data Structure revision number: 1                                                                                                                                                                                           
Vendor Specific SMART Attributes with Thresholds:                                                                                                                                                                                            
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                                                                                                                             
  5 Reallocated_Sector_Ct   0x0003   100   100   000    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0003   100   100   000    Pre-fail  Always       -       46
175 Program_Fail_Count_Chip 0x0003   100   100   000    Pre-fail  Always       -       0
176 Erase_Fail_Count_Chip   0x0003   100   100   000    Pre-fail  Always       -       0
177 Wear_Leveling_Count     0x0003   100   100   000    Pre-fail  Always       -       1946
178 Used_Rsvd_Blk_Cnt_Chip  0x0003   100   100   000    Pre-fail  Always       -       0
179 Used_Rsvd_Blk_Cnt_Tot   0x0003   100   100   000    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   000    Pre-fail  Always       -       1216
181 Program_Fail_Cnt_Total  0x0003   100   100   000    Pre-fail  Always       -       0
182 Erase_Fail_Count_Total  0x0003   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0003   100   100   000    Pre-fail  Always       -       0
195 Hardware_ECC_Recovered  0x0003   100   100   000    Pre-fail  Always       -       0
241 Total_LBAs_Written      0x0003   100   100   000    Pre-fail  Always       -       8704
242 Total_LBAs_Read         0x0003   100   100   000    Pre-fail  Always       -       1385

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         0         -
# 2  Short offline       Completed without error       00%         0         -

Selective Self-tests/Logging not supported

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/539467

Title:
  SATA link power management causes disk errors and corruption

Status in The Linux Kernel:
  Expired
Status in “linux” package in Ubuntu:
  Invalid
Status in “pm-utils” package in Ubuntu:
  Fix Released
Status in “pm-utils-powersave-policy” package in Ubuntu:
  Invalid
Status in “linux” source package in Lucid:
  Won't Fix
Status in “pm-utils” source package in Lucid:
  Invalid
Status in “pm-utils-powersave-policy” source package in Lucid:
  Fix Released
Status in “linux” source package in Maverick:
  Invalid
Status in “pm-utils” source package in Maverick:
  Invalid
Status in “pm-utils-powersave-policy” source package in Maverick:
  Invalid
Status in “linux” source package in Natty:
  Invalid
Status in “pm-utils” source package in Natty:
  Fix Released
Status in “pm-utils-powersave-policy” source package in Natty:
  Invalid

Bug description:
  SRU Justification for pm-utils-powersave-policy:

  Impact: On certain hardware, enabling power saving for the SATA link
  can cause data corruption.

  How Addressed: The proposed branch removes the sata link power policy
  script. This will cause the link to be maintained at the normal power
  usage instead of dropping when the power is removed from the machine.

  Reproduction: On an affected machine, unplug and plug in the power a
  few times. Data corruption will result.

  Regression Potential: Removing the script will cause the SATA link to
  stay fully powered at all times. This may cause an increase in the
  battery usage for some machines. There should be no functionality
  regressions or bugs introduced by this change.

  =====

  Using Lucid on my laptop, I see errors like this in dmesg quite
  frequently (every few hours):

  Mar 14 23:00:09 chris-laptop kernel: [42987.460608] ata1.00: exception Emask 0x10 SAct 0x1 SErr 0x50000 action 0xe frozen
  Mar 14 23:00:09 chris-laptop kernel: [42987.460618] ata1.00: irq_stat 0x00400000, PHY RDY changed
  Mar 14 23:00:09 chris-laptop kernel: [42987.460627] ata1: SError: { PHYRdyChg CommWake }
  Mar 14 23:00:09 chris-laptop kernel: [42987.460635] ata1.00: failed command: READ FPDMA QUEUED
  Mar 14 23:00:09 chris-laptop kernel: [42987.460649] ata1.00: cmd 60/08:00:97:23:44/00:00:01:00:00/40 tag 0 ncq 4096 in
  Mar 14 23:00:09 chris-laptop kernel: [42987.460652]          res 40/00:04:97:23:44/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
  Mar 14 23:00:09 chris-laptop kernel: [42987.460669] ata1.00: status: { DRDY }
  Mar 14 23:00:09 chris-laptop kernel: [42987.460681] ata1: hard resetting link
  Mar 14 23:00:09 chris-laptop kernel: [42987.523336] ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0xe frozen
  Mar 14 23:00:09 chris-laptop kernel: [42987.523346] ata2: irq_stat 0x00400000, PHY RDY changed
  Mar 14 23:00:09 chris-laptop kernel: [42987.523355] ata2: SError: { PHYRdyChg CommWake }
  Mar 14 23:00:09 chris-laptop kernel: [42987.523368] ata2: hard resetting link
  Mar 14 23:00:09 chris-laptop kernel: [42988.202586] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  Mar 14 23:00:09 chris-laptop kernel: [42988.205443] ata1.00: configured for UDMA/133
  Mar 14 23:00:09 chris-laptop kernel: [42988.205459] ata1: EH complete
  Mar 14 23:00:09 chris-laptop kernel: [42988.280089] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
  Mar 14 23:00:09 chris-laptop kernel: [42988.285567] ata2.00: configured for UDMA/100
  Mar 14 23:00:09 chris-laptop kernel: [42988.289370] ata2: EH complete

  Every couple of days, this results in data corruption and my
  filesystem being remounted read-only:

  [ 6148.305806] Aborting journal on device sda1-8.
  [ 6148.325011] EXT4-fs error (device sda1): ext4_journal_start_sb: Detected aborted journal
  [ 6148.325018] EXT4-fs (sda1): Remounting filesystem read-only
  [ 6148.326702] journal commit I/O error
  [ 6148.330975] EXT4-fs error (device sda1) in ext4_reserve_inode_write: Journal has aborted
  [ 6148.462572] __ratelimit: 15 callbacks suppressed

  Those messages generally appear at the end of dmesg after the event,
  just after the "hard resetting link" message. I then have to boot a
  live CD and manually run fsck, as I can no longer boot the laptop.

  This is happening every couple of days generally, although it happened
  3 times in one day last Thursday.

  I did contemplate it being a hardware issue, but I tried running the
  kernel from Karmic for a couple of days, and that worked ok without a
  single error message

  ProblemType: Bug
  AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
  Architecture: amd64
  AudioDevicesInUse:
   USER        PID ACCESS COMMAND
   /dev/snd/controlC0:  chr1s      4010 F.... pulseaudio
   /dev/snd/controlC1:  chr1s      4010 F.... pulseaudio
  CRDA: Error: [Errno 2] No such file or directory
  Card0.Amixer.info:
   Card hw:0 'Intel'/'HDA Intel at 0xf6afc000 irq 21'
     Mixer name	: 'Intel G45 DEVCTG'
     Components	: 'HDA:111d76b2,10280263,00100302 HDA:80862802,80860101,00100000'
     Controls      : 22
     Simple ctrls  : 11
  Card1.Amixer.info:
   Card hw:1 'U0x46d0x9a4'/'USB Device 0x46d:0x9a4 at usb-0000:00:1a.7-3.3, high speed'
     Mixer name	: 'USB Mixer'
     Components	: 'USB046d:09a4'
     Controls      : 2
     Simple ctrls  : 1
  Card1.Amixer.values:
   Simple mixer control 'Mic',0
     Capabilities: cvolume cvolume-joined cswitch cswitch-joined penum
     Capture channels: Mono
     Limits: Capture 0 - 14
     Mono: Capture 0 [0%] [23.75dB] [on]
  Date: Tue Mar 16 10:07:41 2010
  DistroRelease: Ubuntu 10.04
  Frequency: Once a day.
  HibernationDevice: RESUME=UUID=762f3439-67ac-4828-aa94-caf2a2ba0f9a
  InstallationMedia: Ubuntu 9.10 "Karmic Koala" - Release amd64 (20091027)
  LiveMediaBuild: Ubuntu 9.10 "Karmic Koala" - Release amd64 (20091027)
  MachineType: Dell Inc. Latitude E5500
  Package: linux-image-2.6.32-16-generic 2.6.32-16.25
  PccardctlIdent:
   Socket 0:
     no product info available
  PccardctlStatus:
   Socket 0:
     no card
  ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-16-generic root=UUID=4ce5e12b-6e82-4fa4-90ff-7d9859d7504e ro quiet splash
  ProcEnviron:
   LANG=en_GB.utf8
   SHELL=/bin/bash
  ProcVersionSignature: Ubuntu 2.6.32-16.25-generic
  Regression: Yes
  RelatedPackageVersions: linux-firmware 1.32
  Reproducible: No
  SourcePackage: linux
  TestedUpstream: No
  Uname: Linux 2.6.32-16-generic x86_64
  dmi.bios.date: 11/05/2009
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: A15
  dmi.board.name: 0DW635
  dmi.board.vendor: Dell Inc.
  dmi.chassis.type: 8
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: dmi:bvnDellInc.:bvrA15:bd11/05/2009:svnDellInc.:pnLatitudeE5500:pvr:rvnDellInc.:rn0DW635:rvr:cvnDellInc.:ct8:cvr:
  dmi.product.name: Latitude E5500
  dmi.sys.vendor: Dell Inc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/539467/+subscriptions