kernel-packages team mailing list archive

Thread
Date
[Bug 1042369] Re: SCSI bus errors with 3TB HDDs + data corruption

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Sean Clarke <sean.clarke@xxxxxxxxxxxxxxxxxxxx>
Date: Sat, 04 Jan 2014 12:58:43 -0000
Reply-to: Bug 1042369 <1042369@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
As my last posting hinted at, this was fixed in 3.6.0 - happy to close
off.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1042369

Title:
  SCSI bus errors with 3TB HDDs + data corruption

Status in “linux” package in Ubuntu:
  Incomplete

Bug description:
  Hi,
      I am building a NAS unit with 6x Seagate 3TB HDDs. When using the drives I get a flood of errors in the kernel log.

  I have replaced the motherboard and even upgraded to 12.10 'Quantal
  Quetzal' and still the broblem remains.


  Aug 25 19:06:58 enterprise kernel: [  595.548983] ata7.00: error: { ICRC ABRT }
  Aug 25 19:06:58 enterprise kernel: [  595.549359] ata7: hard resetting link
  Aug 25 19:06:58 enterprise kernel: [  595.629945] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 310)
  Aug 25 19:06:58 enterprise kernel: [  595.769862] ata8.00: configured for UDMA/133
  Aug 25 19:06:58 enterprise kernel: [  595.769889] sd 7:0:0:0: [sdc]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  Aug 25 19:06:58 enterprise kernel: [  595.769893] sd 7:0:0:0: [sdc]  Sense Key : Aborted Command [current] [descriptor]
  Aug 25 19:06:58 enterprise kernel: [  595.769898] Descriptor sense data with sense descriptors (in hex):
  Aug 25 19:06:58 enterprise kernel: [  595.769900]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 01
  Aug 25 19:06:58 enterprise kernel: [  595.769910]         5d 50 9f b0
  Aug 25 19:06:58 enterprise kernel: [  595.769914] sd 7:0:0:0: [sdc]  Add. Sense: Scsi parity error
  Aug 25 19:06:58 enterprise kernel: [  595.769918] sd 7:0:0:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 5d 50 9f b0 00 00 03 a8 00 00
  Aug 25 19:06:58 enterprise kernel: [  595.769930] end_request: I/O error, dev sdc, sector 5860532144
  Aug 25 19:06:58 enterprise kernel: [  595.770250] quiet_error: 502 callbacks suppressed
  Aug 25 19:06:58 enterprise kernel: [  595.770253] Buffer I/O error on device sdc, logical block 732566518
  Aug 25 19:06:58 enterprise kernel: [  595.770567] lost page write due to I/O error on sdc
  Aug 25 19:06:58 enterprise kernel: [  595.770571] Buffer I/O error on device sdc, logical block 732566519
  Aug 25 19:06:58 enterprise kernel: [  595.770874] lost page write due to I/O error on sdc
  Aug 25 19:06:58 enterprise kernel: [  595.770877] Buffer I/O error on device sdc, logical block 732566520
  Aug 25 19:06:58 enterprise kernel: [  595.771193] lost page write due to I/O error on sdc
  Aug 25 19:06:58 enterprise kernel: [  595.771196] Buffer I/O error on device sdc, logical block 732566521
  Aug 25 19:06:58 enterprise kernel: [  595.771556] lost page write due to I/O error on sdc
  Aug 25 19:06:58 enterprise kernel: [  595.771559] Buffer I/O error on device sdc, logical block 732566522
  Aug 25 19:06:58 enterprise kernel: [  595.771910] lost page write due to I/O error on sdc
  Aug 25 19:06:58 enterprise kernel: [  595.771913] Buffer I/O error on device sdc, logical block 732566523
  Aug 25 19:06:58 enterprise kernel: [  595.772260] lost page write due to I/O error on sdc

  Aug 25 19:06:58 enterprise kernel: [  595.773664] ata8: EH complete
  Aug 25 19:06:58 enterprise kernel: [  595.794893] ata8.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x6
  Aug 25 19:06:58 enterprise kernel: [  595.795185] ata8.00: irq_stat 0x40000008
  Aug 25 19:06:58 enterprise kernel: [  595.795464] ata8.00: failed command: WRITE FPDMA QUEUED
  Aug 25 19:06:58 enterprise kernel: [  595.795742] ata8.00: cmd 61/00:00:00:0c:00/04:00:00:00:00/40 tag 0 ncq 524288 out
  Aug 25 19:06:58 enterprise kernel: [  595.795743]          res 41/84:00:00:0c:00/00:04:00:00:00/00 Emask 0x410 (ATA bus error) <F>
  Aug 25 19:06:58 enterprise kernel: [  595.796292] ata8.00: status: { DRDY ERR }
  Aug 25 19:06:58 enterprise kernel: [  595.796581] ata8.00: error: { ICRC ABRT }
  Aug 25 19:06:58 enterprise kernel: [  595.796894] ata8: hard resetting link
  Aug 25 19:06:58 enterprise kernel: [  595.873861] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 310)
  Aug 25 19:06:58 enterprise kernel: [  596.017488] ata7.00: configured for UDMA/33
  Aug 25 19:06:58 enterprise kernel: [  596.017502] ata7: EH complete
  Aug 25 19:06:58 enterprise kernel: [  596.038766] ata7.00: exception Emask 0x0 SAct 0x3f SErr 0x0 action 0x6
  Aug 25 19:06:58 enterprise kernel: [  596.039055] ata7.00: irq_stat 0x40000008
  Aug 25 19:06:58 enterprise kernel: [  596.039334] ata7.00: failed command: WRITE FPDMA QUEUED
  Aug 25 19:06:58 enterprise kernel: [  596.039614] ata7.00: cmd 61/00:00:b0:97:50/04:00:5d:01:00/40 tag 0 ncq 524288 out
  Aug 25 19:06:58 enterprise kernel: [  596.039616]          res 41/84:00:b0:97:50/00:04:5d:01:00/00 Emask 0x410 (ATA bus error) <F>
  Aug 25 19:06:58 enterprise kernel: [  596.040165] ata7.00: status: { DRDY ERR }
  Aug 25 19:06:58 enterprise kernel: [  596.040459] ata7.00: error: { ICRC ABRT }
  Aug 25 19:06:58 enterprise kernel: [  596.040774] ata7: hard resetting link
  Aug 25 19:06:59 enterprise kernel: [  596.121778] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 310)
  Aug 25 19:06:59 enterprise kernel: [  596.261840] ata8.00: configured for UDMA/133
  Aug 25 19:06:59 enterprise kernel: [  596.261866] sd 7:0:0:0: [sdc]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  Aug 25 19:06:59 enterprise kernel: [  596.261871] sd 7:0:0:0: [sdc]  Sense Key : Aborted Command [current] [descriptor]
  Aug 25 19:06:59 enterprise kernel: [  596.261875] Descriptor sense data with sense descriptors (in hex):
  Aug 25 19:06:59 enterprise kernel: [  596.261877]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
  Aug 25 19:06:59 enterprise kernel: [  596.261887]         00 00 0c 00
  Aug 25 19:06:59 enterprise kernel: [  596.261891] sd 7:0:0:0: [sdc]  Add. Sense: Scsi parity error
  Aug 25 19:06:59 enterprise kernel: [  596.261895] sd 7:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 0c 00 00 04 00 00
  Aug 25 19:06:59 enterprise kernel: [  596.261904] end_request: I/O error, dev sdc, sector 3072
  Aug 25 19:06:59 enterprise kernel: [  596.262275] ata8: EH complete

  
  Just to recap - all 6 drives are affected at one point or another, 2 systems(Core i7 and AMD Bulldozer 8 core) and 2 kernels (12.01 running 3.2.0-29-generic #46-Ubuntu SMP and 12.10 running 3.5.0-11-generic #11-Ubuntu SMP) and I have also tried a new PSU in case that was the cause,

  I have experienced intermittent errors using parted (when writing to
  disk) and both ext4 and btrfs show data errors - btrfs found over 10K
  errors (and corrected them) during a scrub.

  I have run the smart quick tests and the disks report the are all
  good.

  Here is the output from one of them:

  
  smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.5.0-11-generic] (local build)
  Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

  === START OF INFORMATION SECTION ===
  Model Family:     Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
  Device Model:     ST3000DM001-1CH166
  Serial Number:    S1F0MM27
  LU WWN Device Id: 5 000c50 051759d96
  Firmware Version: CC43
  User Capacity:    3,000,592,982,016 bytes [3.00 TB]
  Sector Sizes:     512 bytes logical, 4096 bytes physical
  Device is:        In smartctl database [for details use: -P show]
  ATA Version is:   8
  ATA Standard is:  ATA-8-ACS revision 4
  Local Time is:    Mon Aug 27 20:33:49 2012 BST
  SMART support is: Available - device has SMART capability.
  SMART support is: Enabled

  === START OF READ SMART DATA SECTION ===
  SMART overall-health self-assessment test result: PASSED

  General SMART Values:
  Offline data collection status:  (0x82)	Offline data collection activity
  					was completed without error.
  					Auto Offline Data Collection: Enabled.
  Self-test execution status:      (   0)	The previous self-test routine completed
  					without error or no self-test has ever 
  					been run.
  Total time to complete Offline 
  data collection: 		(  575) seconds.
  Offline data collection
  capabilities: 			 (0x7b) SMART execute Offline immediate.
  					Auto Offline data collection on/off support.
  					Suspend Offline collection upon new
  					command.
  					Offline surface scan supported.
  					Self-test supported.
  					Conveyance Self-test supported.
  					Selective Self-test supported.
  SMART capabilities:            (0x0003)	Saves SMART data before entering
  					power-saving mode.
  					Supports SMART auto save timer.
  Error logging capability:        (0x01)	Error logging supported.
  					General Purpose Logging supported.
  Short self-test routine 
  recommended polling time: 	 (   1) minutes.
  Extended self-test routine
  recommended polling time: 	 ( 333) minutes.
  Conveyance self-test routine
  recommended polling time: 	 (   2) minutes.
  SCT capabilities: 	       (0x3085)	SCT Status supported.

  SMART Attributes Data Structure revision number: 10
  Vendor Specific SMART Attributes with Thresholds:
  ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate     0x000f   115   100   006    Pre-fail  Always       -       95707744
    3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
    4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       60
    5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
    7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       4294983044
    9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9
   10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
   12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       61
  183 Runtime_Bad_Block       0x0032   095   095   000    Old_age   Always       -       5
  184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
  187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
  188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
  189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
  190 Airflow_Temperature_Cel 0x0022   072   056   045    Old_age   Always       -       28 (Min/Max 26/28)
  191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
  192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       51
  193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       65
  194 Temperature_Celsius     0x0022   028   044   000    Old_age   Always       -       28 (0 22 0 0 0)
  197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
  198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
  199 UDMA_CRC_Error_Count    0x003e   200   163   000    Old_age   Always       -       69
  240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       79070347919370
  241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       232499601
  242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       93817799

  SMART Error Log Version: 1
  No Errors Logged

  SMART Self-test log structure revision number 1
  Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
  # 1  Short offline       Completed without error       00%         6         -

  SMART Selective self-test log data structure revision number 1
   SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
      1        0        0  Not_testing
      2        0        0  Not_testing
      3        0        0  Not_testing
      4        0        0  Not_testing
      5        0        0  Not_testing
  Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
  If Selective self-test is pending on power-up, resume after 0 minute delay.

  ====================================================================

  I can get the disks to fail almost immediately by doing a mkfs.btrfs
  across the full set - at some point one or two of them flood the
  syslog with errors.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1042369/+subscriptions