← Back to team overview

kernel-packages team mailing list archive

[Bug 1429959] fix eeh on qla2xxx

 

------- Comment on attachment From thadeul@xxxxxxxxxx 2015-04-16 18:57 EDT-------


Mauricio and I have investigated this, and found out that there is a code that checks for all FF's and might set the HBA offline and release queues, which might even crash the system, during our tests.

The attached patch seems to fix all the problems we have seen, including
when the adapter seems to have recovered, but can't do any IO.

We have tested this on top of 4.0.

Mauricio, can you build a package with this patch on top of Vivid's
kernel, so this can be tested?

Thanks.
Cascardo.

** Attachment added: "fix eeh on qla2xxx"
   https://bugs.launchpad.net/bugs/1429959/+attachment/4397679/+files/qla2xxx_eeh.patch

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1429959

Title:
  Auto Error Recovery is failing after error injected for sailfish card
  in Ubuntu 14.10 [PowerNV]

Status in linux package in Ubuntu:
  New

Bug description:
  ---Problem Description---

  PowerNV/Ubuntu 14.10 Auto Error Recovery is failing after error injected for sailfish
   
  ---uname output---
  Linux powerio-le21 3.16.0-23-generic #31-Ubuntu SMP Tue Oct 21 17:55:08 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = 8286-42A 
    
  ---Steps to Reproduce---
   
  There are 2 LUNs coming across 3 different paths and multipath is configured.

  1. Run I/O activity by running HTX load on the multipath devices.
  2. Verify I/O activity on the multipath devices by iostat command
  2. Injected error by the following command in 
  echo 0x8000000000000000 > /sys/kernel/debug/powerpc/PCI0001/err_injct_inboundA;

      sleep 1;
      echo 0x0 > /sys/kernel/debug/powerpc/PCI0001/err_injct_inboundA

  3. The error injection happened and the I/O activity was suspended as confirmed by iostat.
  4. Error recovery of the PCI devices did not happen and the devices remained inaccessible.

  The dmesg during the event is as follows

  [  376.148715] systemd-logind[7123]: New session 6 of user root.
  [  497.572751] EEH: Frozen PHB#1-PE#8 detected
  [  497.572799] EEH: PE location: U78C9.001.WZS006T-P1-C12     , PHB location: U78C9.001.WZS006T-P1-C32
  [  497.572890] CPU: 32 PID: 0 Comm: swapper/32 Tainted: G           OE 3.16.0-23-generic #31-Ubuntu
  [  497.572892] Call Trace:
  [  497.572898] [c000003fffe97b90] [c000000000017390] show_stack+0x170/0x290 (unreliable)
  [  497.572902] [c000003fffe97c70] [c000000000a05fc0] dump_stack+0x90/0xbc
  [  497.572906] [c000003fffe97ca0] [c000000000038010] eeh_dev_check_failure+0x560/0x580
  [  497.572908] [c000003fffe97d40] [c0000000000380b8] eeh_check_failure+0x88/0xe0
  [  497.572933] [c000003fffe97d80] [d00000001cb247a8] qla24xx_msix_rsp_q+0x108/0x200 [qla2xxx]
  [  497.572936] [c000003fffe97e10] [c0000000001319b0] handle_irq_event_percpu+0x90/0x2b0
  [  497.572938] [c000003fffe97ed0] [c000000000131c38] handle_irq_event+0x68/0xd0
  [  497.572940] [c000003fffe97f00] [c000000000136f80] handle_fasteoi_irq+0xe0/0x2a0
  [  497.572942] [c000003fffe97f30] [c000000000130ca8] generic_handle_irq+0x58/0x90
  [  497.572943] [c000003fffe97f60] [c0000000000119c0] __do_irq+0x80/0x190
  [  497.572945] [c000003fffe97f90] [c0000000000253d0] call_do_irq+0x14/0x24
  [  497.572946] [c000002fe83abab0] [c000000000011b68] do_IRQ+0x98/0x140
  [  497.572948] [c000002fe83abb00] [c000000000002794] hardware_interrupt_common+0x114/0x180
  [  497.572952] --- Exception: 501 at snooze_loop+0xd8/0x170
      LR = snooze_loop+0x90/0x170
  [  497.572955] [c000002fe83abdf0] [c000000000a33680] cpu_online_mask+0x0/0x8 (unreliable)
  [  497.572957] [c000002fe83abe30] [c0000000008405bc] cpuidle_enter_state+0x6c/0x140
  [  497.572960] [c000002fe83abe80] [c000000000113938] cpu_startup_entry+0x318/0x4c0
  [  497.572962] [c000002fe83abf20] [c000000000043844] start_secondary+0x324/0x350
  [  497.572964] [c000002fe83abf90] [c000000000009a6c] start_secondary_prolog+0x10/0x14
  [  497.572973] EEH: Detected PCI bus error on PHB#1-PE#8
  [  497.572978] EEH: This PCI device has failed 1 times in the last hour
  [  497.572979] EEH: Notify device drivers to shutdown
  [  497.573000] qla2xxx [0001:07:00.0]-015b:2: Disabling adapter.
  [  497.573071] sd 2:0:1:1: [sdd] Unhandled error code
  [  497.573072] sd 2:0:1:1: [sdd] Unhandled error code
  [  497.573075] sd 2:0:1:0: [sdc] Unhandled error code
  [  497.573076] sd 2:0:1:1: [sdd] Unhandled error code
  [  497.573077] sd 2:0:1:1: [sdd] Unhandled error code
  [  497.573078] sd 2:0:1:1: [sdd]  
  [  497.573079] sd 2:0:1:0: [sdc] Unhandled error code
  [  497.573080] sd 2:0:1:0: [sdc] Unhandled error code
  [  497.573081] sd 2:0:1:1: [sdd]  
  [  497.573082] sd 2:0:1:1: [sdd] Unhandled error code
  [  497.573084] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573085] sd 2:0:1:0: [sdc] Unhandled error code
  [  497.573086] sd 2:0:1:1: [sdd] CDB: 
  [  497.573087] sd 2:0:1:1: [sdd]  
  [  497.573088] sd 2:0:1:0: [sdc]  
  [  497.573088] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573089] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573090] sd 2:0:1:1: [sdd] CDB: 
  [  497.573091] sd 2:0:1:1: [sdd]  
  [  497.573095] Read(10)
  [  497.573095] sd 2:0:1:0: [sdc]  
  [  497.573096] sd 2:0:1:0: [sdc]  
  [  497.573097] :
  [  497.573097] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573099] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573100] Read(10)
  [  497.573100] sd 2:0:1:1: [sdd] CDB: 
  [  497.573101] sd 2:0:1:0: [sdc]  
  [  497.573103] :
  [  497.573103]  28
  [  497.573104] Read(10)
  [  497.573104] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573106]  28 00
  [  497.573107] :
  [  497.573108]  00
  [  497.573108] sd 2:0:1:1: [sdd] CDB: 
  [  497.573108]  00
  [  497.573109] Read(10)
  [  497.573109]  28
  [  497.573110]  31
  [  497.573111] :
  [  497.573111] sd 2:0:1:0: [sdc] CDB: 
  [  497.573113]  03
  [  497.573115]  28 00
  [  497.573124]  fe
  [  497.573124] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573125] sd 2:0:1:0: [sdc] Unhandled error code
  [  497.573126] sd 2:0:1:0: [sdc] CDB: 
  [  497.573127] sd 2:0:1:0: [sdc] CDB: 
  [  497.573128] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573129] sd 2:0:1:0: [sdc]  
  [  497.573130]  00
  [  497.573131]  00
  [  497.573132]  00
  [  497.573133]  78
  [  497.573133]  90
  [  497.573134]  00
  [  497.573134]  00
  [  497.573135]  c9
  [  497.573135] sd 2:0:1:0: [sdc] CDB: 
  [  497.573137] Read(10)
  [  497.573137] Read(10)
  [  497.573138] Read(10)
  [  497.573138] :
  [  497.573139] :
  [  497.573140] :
  [  497.573141]  28
  [  497.573141] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573142]  28
  [  497.573143] Read(10)
  [  497.573144]  28
  [  497.573144]  00
  [  497.573145]  00
  [  497.573146] :
  [  497.573147]  00
  [  497.573147]  00
  [  497.573149]  6c de
  [  497.573149]  00
  [  497.573150]  00
  [  497.573151]  16
  [  497.573152]  50
  [  497.573153]  58
  [  497.573153]  00
  [  497.573154]  00
  [  497.573154]  00
  [  497.573161]  00
  [  497.573163]  00
  [  497.573164]  02
  [  497.573164]  20
  [  497.573165]  80
  [  497.573166]  00
  [  497.573166]  00
  [  497.573167]  b7
  [  497.573167]  00
  [  497.573168]  f0

  [  497.573171]  00
  [  497.573171] end_request: I/O error, dev sdd, sector 3212286
  [  497.573172]  05
  [  497.573173]  00

  [  497.573178]  00
  [  497.573179] end_request: I/O error, dev sdd, sector 7915856

  [  497.573186]  bf 9d
  [  497.573188]  07
  [  497.573189]  d7
  [  497.573190]  00
  [  497.573190] device-mapper: multipath: Failing path 8:48.
  [  497.573192] sd 2:0:1:0: [sdc] CDB: 
  [  497.573195]  12
  [  497.573197]  00
  [  497.573197]  00
  [  497.573198]  fb
  [  497.573199]  28
  [  497.573200]  01
  [  497.573201]  00
  [  497.573201] end_request: I/O error, dev sdd, sector 9437272
  [  497.573211]  40
  [  497.573212]  00
  [  497.573212]  00
  [  497.573213]  00
  [  497.573214]  00
  [  497.573214]  00
  [  497.573217]  00
  [  497.573230]  01
  [  497.573231]  40
  [  497.573232] end_request: I/O error, dev sdd, sector 2144240

  [  497.573241]  00
  [  497.573241] EEH: Collect temporary log
  [  497.573243] Read(10)
  [  497.573244] end_request: I/O error, dev sdc, sector 1490845
  [  497.573247]  a8
  [  497.573249] :
  [  497.573250]  00
  [  497.573250]  3d

  [  497.573251]  de
  [  497.573251] device-mapper: multipath: Failing path 8:32.

  [  497.573258]  00 00
  [  497.573258] end_request: I/O error, dev sdc, sector 14095099
  [  497.573262]  28
  [  497.573262] end_request: I/O error, dev sdc, sector 7134727
  [  497.573266]  02 00
  [  497.573269]  00
  [  497.573269] end_request: I/O error, dev sdc, sector 11025886
  [  497.573296]  00 40 0c a3 00 00 80 00
  [  497.573298] end_request: I/O error, dev sdc, sector 4197539
  [  497.573430] EEH: of node=/pciex@3fffe40100000/pci@0/pci@0/pci@9/pci@0/pci@2/fibre-channel@0
  [  497.573525] EEH: PCI device/vendor: 25321077
  [  497.573557] EEH: PCI cmd/status register: 00100142
  [  497.573581] EEH: PCI-E capabilities and status follow:
  [  497.573601] EEH: PCI-E 00: 00028810 10008103 0009585e 0000d482 
  [  497.573611] EEH: PCI-E 10: 10420040 00000000 00000000 00000000 
  [  497.573612] EEH: PCI-E 20: 00000000 
  [  497.573612] EEH: PCI-E AER capability register set follows:
  [  497.573621] EEH: PCI-E AER 00: 13810001 00000000 00000000 00062030 
  [  497.573629] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000 
  [  497.573636] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
  [  497.573638] EEH: PCI-E AER 30: 00000000 00000000 
  [  497.573639] EEH: of node=/pciex@3fffe40100000/pci@0/pci@0/pci@9/pci@0/pci@2/fibre-channel@0,1
  [  497.573641] EEH: PCI device/vendor: 25321077
  [  497.573643] EEH: PCI cmd/status register: 00100142
  [  497.573643] EEH: PCI-E capabilities and status follow:
  [  497.573652] EEH: PCI-E 00: 00028810 10008103 0009585e 0000d482 
  [  497.573665] EEH: PCI-E 10: 10420040 00000000 00000000 00000000 
  [  497.573666] EEH: PCI-E 20: 00000000 
  [  497.573666] EEH: PCI-E AER capability register set follows:
  [  497.573675] EEH: PCI-E AER 00: 13810001 00000000 00000000 00062030 
  [  497.573682] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000 
  [  497.573689] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
  [  497.573691] EEH: PCI-E AER 30: 00000000 00000000 
  [  497.573693] PHB3 PHB#1 Diag-data (Version: 1)
  [  497.573694] brdgCtl:     00000002
  [  497.573695] RootSts:     00000040 00400000 f0830008 00100147 00002000
  [  497.573695] PhbSts:      0000001c00000000 0000001c00000000
  [  497.573696] Lem:         0000000000100000 42498e327f502eae 0000000000000000
  [  497.573697] InAErr:      8000000000000000 8000000000000000 0402040000000000 0000000000000000
  [  497.573698] PE[  8] A/B: 8480002b00000000 8000000000000000
  [  497.573699] EEH: Reset without hotplug activity
  [  497.573967] sd 2:0:1:1: [sdd]  
  [  497.573968] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  497.573970] sd 2:0:1:1: [sdd] CDB: 
  [  497.573971] Read(10): 28 00 00 1a 15 86 00 00 0c 00
  [  497.573978] end_request: I/O error, dev sdd, sector 1709446
  [  497.577069] sd 2:0:0:0: [sda] Synchronizing SCSI cache
  [  499.630649] EEH: Notify device drivers the completion of reset
  [  499.631187] qla2xxx [0001:07:00.1]-00af:5: Performing ISP error recovery - ha=c00000001320c000.
  [  502.258629] qla2xxx 0001:07:00.1: Direct firmware load failed with error -2
  [  502.258630] qla2xxx 0001:07:00.1: Falling back to user helper
  [  502.259236] qla2xxx [0001:07:00.1]-0063:5: Failed to load firmware image (ql2500_fw.bin).
  [  502.259354] qla2xxx [0001:07:00.1]-0090:5: Fimware image unavailable.
  [  502.259410] qla2xxx [0001:07:00.1]-0091:5: Firmware images can be retrieved from: http://ldriver.qlogic.com/firmware/.
  [  504.322682] qla2xxx [0001:07:00.1]-505f:5: Link is operational (4 Gbps).
  [  504.370577] EEH: Notify device driver to resume
  [  504.370580] qla2xxx [0001:07:00.0]-9002:2: The device failed to resume I/O from slot/link_reset.
  [  504.378577] sd 2:0:0:1: [sdb] Unhandled error code
  [  504.378581] sd 2:0:0:1: [sdb] Unhandled error code
  [  504.378586] sd 2:0:0:1: [sdb]  
  [  504.378588] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.378590] sd 2:0:0:1: [sdb]  
  [  504.378593] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.378594] sd 2:0:0:1: [sdb] CDB: 
  [  504.378604] sd 2:0:0:0: [sda]  
  [  504.378606] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.378610] Read(10)
  [  504.378610] sd 2:0:0:1: [sdb] CDB: 
  [  504.378615] :
  [  504.378622] Read(10): 28 00
  [  504.378622]  28 00
  [  504.378629]  00
  [  504.378637]  00 00 00
  [  504.378643]  00
  [  504.378648]  00 01 00 00
  [  504.378650] end_request: I/O error, dev sdb, sector 0
  [  504.378654]  31 03 fe 00 00 02 00
  [  504.378656] end_request: I/O error, dev sdb, sector 3212286
  [  504.378663] sd 2:0:0:1: [sdb] Unhandled error code
  [  504.378664] sd 2:0:0:1: [sdb]  
  [  504.378665] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.378666] sd 2:0:0:1: [sdb] CDB: 
  [  504.378669] Read(10): 28 00 00 78 c9 50 00 00 80 00
  [  504.378671] end_request: I/O error, dev sdb, sector 7915856
  [  504.378675] sd 2:0:0:1: [sdb] Unhandled error code
  [  504.378676] sd 2:0:0:1: [sdb]  
  [  504.378676] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.378677] sd 2:0:0:1: [sdb] CDB: 
  [  504.378680] Read(10): 28 00 00 20 b7 f0 00 00 40 00
  [  504.378681] end_request: I/O error, dev sdb, sector 2144240
  [  504.378685] sd 2:0:0:1: [sdb] Unhandled error code
  [  504.378686] sd 2:0:0:1: [sdb]  
  [  504.378687] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.378688] sd 2:0:0:1: [sdb] CDB: 
  [  504.378692] Read(10): 28 00 00 1a 15 86 00 00 0c 00
  [  504.378693] end_request: I/O error, dev sdb, sector 1709446
  [  504.378698] sd 2:0:0:1: [sdb] Unhandled error code
  [  504.378699] sd 2:0:0:1: [sdb]  
  [  504.378699] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.378701] sd 2:0:0:1: [sdb] CDB: 
  [  504.378707] Read(10): 28 00 00 90 00 58 00 00 05 00
  [  504.378707] end_request: I/O error, dev sdb, sector 9437272
  [  504.378711] device-mapper: multipath: Failing path 8:16.
  [  504.380205] sd 2:0:0:1: [sdb] Synchronizing SCSI cache
  [  504.380228] sd 2:0:0:1: [sdb]  
  [  504.380230] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.381380] sd 2:0:1:0: [sdc] Synchronizing SCSI cache
  [  504.381398] sd 2:0:1:0: [sdc]  
  [  504.381399] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.382322] sd 2:0:1:1: [sdd] Synchronizing SCSI cache
  [  504.382338] sd 2:0:1:1: [sdd]  
  [  504.382339] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.382617] device-mapper: multipath: Failing path 8:80.
  [  504.383626] sd 2:0:2:0: [sde] Synchronizing SCSI cache
  [  504.383643] sd 2:0:2:0: [sde]  
  [  504.383644] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.384548] sd 2:0:2:1: [sdf] Synchronizing SCSI cache
  [  504.384564] sd 2:0:2:1: [sdf]  
  [  504.384566] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
  [  504.398596] device-mapper: multipath: Failing path 8:0.
  [  504.426588] device-mapper: multipath: Could not failover the device: Handler scsi_dh_rdac Error 15.
  [  504.426661] device-mapper: multipath: Failing path 8:64.
  [  505.205770] qla2xxx [0001:07:00.1]-2064:5: SNS scan failed -- assuming zero-entry result.
  [  516.306267] device-mapper: multipath: Could not failover the device: Handler scsi_dh_rdac Error 15.
  [  516.306421] device-mapper: multipath: Failing path 8:48.
  [  516.306435] device-mapper: multipath: Could not failover the device: Handler scsi_dh_rdac Error 15.
  [  516.306522] device-mapper: multipath: Failing path 8:16.
  [  516.307056] device-mapper: multipath: Could not failover the device: Handler scsi_dh_rdac Error 15.
  [  516.307148] device-mapper: multipath: Failing path 8:0.
  [  516.307157] device-mapper: multipath: Could not failover the device: Handler scsi_dh_rdac Error 15.
  [  516.307244] device-mapper: multipath: Failing path 8:32.

  Please find the host kernel log details in the attachment.
   

  I don't see any suspecting points from EEH core side. It seems that
  the device driver failed to resume the device as expected. Could you
  please ask device driver developer to take a look on this? Maybe the
  device driver missed some fixes. Also, it would be worthy to have a
  test on linux upstream as well to see if that works or not.

  Thanks,
  Gavin

  I performed the EEH test on Ubuntu 14.04.2 ( 3.16.0-29 Linux kernel)
  I am facing the same issue there as well.

  As of now, this has been tested on Ubuntu 14.04.1,14.04.2 and 14.10.
  This issue appears in Ubuntu 14.04.2, Ubuntu 14.10 and 15.04

  Rajesh, could you reload the driver with the parameter
  ql2xextended_error_logging set to 0x00200000 and rerun the EEH test?

  This should enable additional logs that might be useful in order to
  debug this issue. Could you attach the full dmesg output of this test
  to the bug?

  Bear with me since this is the first time I work with this adapter / driver, but this particular section of the log sounds a bit worrying to me, I will take a look at the driver to see if I can figure out how it works:
  [  502.258629] qla2xxx 0001:07:00.1: Direct firmware load failed with error -2
  [  502.258630] qla2xxx 0001:07:00.1: Falling back to user helper
  [  502.259236] qla2xxx [0001:07:00.1]-0063:5: Failed to load firmware image (ql2500_fw.bin).
  [  502.259354] qla2xxx [0001:07:00.1]-0090:5: Fimware image unavailable.

  I'll also take a look if we are missing any patches in the Ubuntu
  release that might solve this issue.

  Looking at the log it is mentioned that slot reset returned 5, looking
  at pci.h it means that the value returned was
  PCI_ERS_RESULT_RECOVERED, so the device driver is fully recovered and
  operational.

  I am taking a look at the drivers code as to why the driver is failing
  to resume the adapter.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1429959/+subscriptions