kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #38421
[Bug 1042369] Re: SCSI bus errors with 3TB HDDs + data corruption
As my last posting hinted at, this was fixed in 3.6.0 - happy to close
off.
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1042369
Title:
SCSI bus errors with 3TB HDDs + data corruption
Status in “linux” package in Ubuntu:
Incomplete
Bug description:
Hi,
I am building a NAS unit with 6x Seagate 3TB HDDs. When using the drives I get a flood of errors in the kernel log.
I have replaced the motherboard and even upgraded to 12.10 'Quantal
Quetzal' and still the broblem remains.
Aug 25 19:06:58 enterprise kernel: [ 595.548983] ata7.00: error: { ICRC ABRT }
Aug 25 19:06:58 enterprise kernel: [ 595.549359] ata7: hard resetting link
Aug 25 19:06:58 enterprise kernel: [ 595.629945] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 310)
Aug 25 19:06:58 enterprise kernel: [ 595.769862] ata8.00: configured for UDMA/133
Aug 25 19:06:58 enterprise kernel: [ 595.769889] sd 7:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 25 19:06:58 enterprise kernel: [ 595.769893] sd 7:0:0:0: [sdc] Sense Key : Aborted Command [current] [descriptor]
Aug 25 19:06:58 enterprise kernel: [ 595.769898] Descriptor sense data with sense descriptors (in hex):
Aug 25 19:06:58 enterprise kernel: [ 595.769900] 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 01
Aug 25 19:06:58 enterprise kernel: [ 595.769910] 5d 50 9f b0
Aug 25 19:06:58 enterprise kernel: [ 595.769914] sd 7:0:0:0: [sdc] Add. Sense: Scsi parity error
Aug 25 19:06:58 enterprise kernel: [ 595.769918] sd 7:0:0:0: [sdc] CDB: Write(16): 8a 00 00 00 00 01 5d 50 9f b0 00 00 03 a8 00 00
Aug 25 19:06:58 enterprise kernel: [ 595.769930] end_request: I/O error, dev sdc, sector 5860532144
Aug 25 19:06:58 enterprise kernel: [ 595.770250] quiet_error: 502 callbacks suppressed
Aug 25 19:06:58 enterprise kernel: [ 595.770253] Buffer I/O error on device sdc, logical block 732566518
Aug 25 19:06:58 enterprise kernel: [ 595.770567] lost page write due to I/O error on sdc
Aug 25 19:06:58 enterprise kernel: [ 595.770571] Buffer I/O error on device sdc, logical block 732566519
Aug 25 19:06:58 enterprise kernel: [ 595.770874] lost page write due to I/O error on sdc
Aug 25 19:06:58 enterprise kernel: [ 595.770877] Buffer I/O error on device sdc, logical block 732566520
Aug 25 19:06:58 enterprise kernel: [ 595.771193] lost page write due to I/O error on sdc
Aug 25 19:06:58 enterprise kernel: [ 595.771196] Buffer I/O error on device sdc, logical block 732566521
Aug 25 19:06:58 enterprise kernel: [ 595.771556] lost page write due to I/O error on sdc
Aug 25 19:06:58 enterprise kernel: [ 595.771559] Buffer I/O error on device sdc, logical block 732566522
Aug 25 19:06:58 enterprise kernel: [ 595.771910] lost page write due to I/O error on sdc
Aug 25 19:06:58 enterprise kernel: [ 595.771913] Buffer I/O error on device sdc, logical block 732566523
Aug 25 19:06:58 enterprise kernel: [ 595.772260] lost page write due to I/O error on sdc
Aug 25 19:06:58 enterprise kernel: [ 595.773664] ata8: EH complete
Aug 25 19:06:58 enterprise kernel: [ 595.794893] ata8.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x6
Aug 25 19:06:58 enterprise kernel: [ 595.795185] ata8.00: irq_stat 0x40000008
Aug 25 19:06:58 enterprise kernel: [ 595.795464] ata8.00: failed command: WRITE FPDMA QUEUED
Aug 25 19:06:58 enterprise kernel: [ 595.795742] ata8.00: cmd 61/00:00:00:0c:00/04:00:00:00:00/40 tag 0 ncq 524288 out
Aug 25 19:06:58 enterprise kernel: [ 595.795743] res 41/84:00:00:0c:00/00:04:00:00:00/00 Emask 0x410 (ATA bus error) <F>
Aug 25 19:06:58 enterprise kernel: [ 595.796292] ata8.00: status: { DRDY ERR }
Aug 25 19:06:58 enterprise kernel: [ 595.796581] ata8.00: error: { ICRC ABRT }
Aug 25 19:06:58 enterprise kernel: [ 595.796894] ata8: hard resetting link
Aug 25 19:06:58 enterprise kernel: [ 595.873861] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 310)
Aug 25 19:06:58 enterprise kernel: [ 596.017488] ata7.00: configured for UDMA/33
Aug 25 19:06:58 enterprise kernel: [ 596.017502] ata7: EH complete
Aug 25 19:06:58 enterprise kernel: [ 596.038766] ata7.00: exception Emask 0x0 SAct 0x3f SErr 0x0 action 0x6
Aug 25 19:06:58 enterprise kernel: [ 596.039055] ata7.00: irq_stat 0x40000008
Aug 25 19:06:58 enterprise kernel: [ 596.039334] ata7.00: failed command: WRITE FPDMA QUEUED
Aug 25 19:06:58 enterprise kernel: [ 596.039614] ata7.00: cmd 61/00:00:b0:97:50/04:00:5d:01:00/40 tag 0 ncq 524288 out
Aug 25 19:06:58 enterprise kernel: [ 596.039616] res 41/84:00:b0:97:50/00:04:5d:01:00/00 Emask 0x410 (ATA bus error) <F>
Aug 25 19:06:58 enterprise kernel: [ 596.040165] ata7.00: status: { DRDY ERR }
Aug 25 19:06:58 enterprise kernel: [ 596.040459] ata7.00: error: { ICRC ABRT }
Aug 25 19:06:58 enterprise kernel: [ 596.040774] ata7: hard resetting link
Aug 25 19:06:59 enterprise kernel: [ 596.121778] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 310)
Aug 25 19:06:59 enterprise kernel: [ 596.261840] ata8.00: configured for UDMA/133
Aug 25 19:06:59 enterprise kernel: [ 596.261866] sd 7:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 25 19:06:59 enterprise kernel: [ 596.261871] sd 7:0:0:0: [sdc] Sense Key : Aborted Command [current] [descriptor]
Aug 25 19:06:59 enterprise kernel: [ 596.261875] Descriptor sense data with sense descriptors (in hex):
Aug 25 19:06:59 enterprise kernel: [ 596.261877] 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Aug 25 19:06:59 enterprise kernel: [ 596.261887] 00 00 0c 00
Aug 25 19:06:59 enterprise kernel: [ 596.261891] sd 7:0:0:0: [sdc] Add. Sense: Scsi parity error
Aug 25 19:06:59 enterprise kernel: [ 596.261895] sd 7:0:0:0: [sdc] CDB: Write(10): 2a 00 00 00 0c 00 00 04 00 00
Aug 25 19:06:59 enterprise kernel: [ 596.261904] end_request: I/O error, dev sdc, sector 3072
Aug 25 19:06:59 enterprise kernel: [ 596.262275] ata8: EH complete
Just to recap - all 6 drives are affected at one point or another, 2 systems(Core i7 and AMD Bulldozer 8 core) and 2 kernels (12.01 running 3.2.0-29-generic #46-Ubuntu SMP and 12.10 running 3.5.0-11-generic #11-Ubuntu SMP) and I have also tried a new PSU in case that was the cause,
I have experienced intermittent errors using parted (when writing to
disk) and both ext4 and btrfs show data errors - btrfs found over 10K
errors (and corrected them) during a scrub.
I have run the smart quick tests and the disks report the are all
good.
Here is the output from one of them:
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.5.0-11-generic] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda (SATA 3Gb/s, 4K Sectors)
Device Model: ST3000DM001-1CH166
Serial Number: S1F0MM27
LU WWN Device Id: 5 000c50 051759d96
Firmware Version: CC43
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Mon Aug 27 20:33:49 2012 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 575) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 333) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 100 006 Pre-fail Always - 95707744
3 Spin_Up_Time 0x0003 091 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 60
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 4294983044
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 9
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 61
183 Runtime_Bad_Block 0x0032 095 095 000 Old_age Always - 5
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 072 056 045 Old_age Always - 28 (Min/Max 26/28)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 51
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 65
194 Temperature_Celsius 0x0022 028 044 000 Old_age Always - 28 (0 22 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 163 000 Old_age Always - 69
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 79070347919370
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 232499601
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 93817799
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 6 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
====================================================================
I can get the disks to fail almost immediately by doing a mkfs.btrfs
across the full set - at some point one or two of them flood the
syslog with errors.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1042369/+subscriptions