kernel-packages team mailing list archive

Thread
Date
[Bug 1442186] proposed fix #2

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: bugproxy <bugproxy@xxxxxxxxxx>
Date: Mon, 11 May 2015 19:50:00 -0000
Reply-to: Bug 1442186 <1442186@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Default Comment by Bridge

** Attachment added: "proposed fix #2"
   https://bugs.launchpad.net/bugs/1442186/+attachment/4395789/+files/nvme_power.patch

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1442186

Title:
  NVMe device driver failing w/nvme-user and htx tools

Status in linux package in Ubuntu:
  Fix Released

Bug description:
  ---Problem Description---
  NVMe device driver running with a Samsung SM1715 card fails (hits EEH) w/nvme-user and htx tools
   
  Machine Type = Tuleta P8 
   
  ---Steps to Reproduce---
   1. Install the nvme-user tools package and run the nvme_rw tool
  or
  2. Install the htx I/O exerciser tool and run htx
   
  ---uname output---
  Linux ubuntu1504 3.18.0-12-generic #13-Ubuntu SMP Thu Jan 29 13:44:26 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
   
  EEH error in kernel log  after running HTX:

  [503195.365335] EEH: Frozen PE#2 on PHB#4 detected
  [503195.365346] EEH: PE location: U78CB.001.D123456-P1-C5      , PHB location: U
  78CB.001.D123456-P1-C33
  [503195.366226] EEH: This PCI device has failed 1 times in the last hour
  [503195.366227] EEH: Notify device drivers to shutdown
  [503195.366233] EEH: Collect temporary log
  [503195.366315] EEH: of node=/pciex@3fffe42000000/pci@0/mass-storage@0
  [503195.366321] EEH: PCI device/vendor: a820144d
  [503195.366327] EEH: PCI cmd/status register: 00100546
  [503195.366329] EEH: PCI-E capabilities and status follow:
  [503195.366356] EEH: PCI-E 00: 00024010 10008fe1 0000202e 00416c43 
  [503195.366372] EEH: PCI-E 10: 10430040 00000000 00000000 00000000 
  [503195.366374] EEH: PCI-E 20: 00000000 
  [503195.366375] EEH: PCI-E AER capability register set follows:
  [503195.366396] EEH: PCI-E AER 00: 18020001 00000000 00400000 00440010
  [503195.366413] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000 
  [503195.366429] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
  [503195.366435] EEH: PCI-E AER 30: 00000000 00000000 
  [503195.366438] PHB3 PHB#4 Diag-data (Version: 1)
  [503195.366440] brdgCtl:     00000002
  [503195.366442] UtlSts:      00200000 00000000 00000000
  [503195.366444] RootSts:     00000020 00400000 f0430048 00100147 00000000
  [503195.366446] PhbSts:      0000001c00000000 0000001c00000000
  [503195.366448] Lem:         0000000004000000 42498e327f502eae 0000000000000000
  [503195.366450] InAErr:      0000000000004000 0000000000004000 0000000000000000 0000000000000000
  [503195.366452] PE[  2] A/B: 8000302500000000 8000000000000000
  [503195.366454] EEH: Reset with hotplug activity
  [503195.367741] nvme 0004:01:00.0: Cancelling I/O 0 QID 57
  [503199.411764] EEH: Sleep 5s ahead of complete hotplug
  [503204.415815] pci 0004:01:00.0: [144d:a820] type 00 class 0x010802
  [503204.415874] pci 0004:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
  [503204.423731] pci 0004:01:00.0: BAR 0: assigned [mem 0x3ff000000000-0x3ff000003fff 64bit]
  [503204.423970] nvme 0004:01:00.0: enabling device (0140 -> 0142)
  [503204.423998] nvme 0004:01:00.0: Using 64-bit DMA iommu bypass
  [503207.914466]  nvme1n1: unknown partition table
  [503207.914643] EEH: Notify device driver to resume

  
  The PHB3 diag-data is telling: DMA write to PCI bus address 0x0, which doesn't have corresponding valid TCE entries. As I know, there are 2 possibilities and (A) might be the case.

  (A) As MSI/MSIx interrupt is essentially DMA write transaction, the
  PCI bus address 0x0 would be MSI/MSIx message address. The address 0x0
  is obviously invalid. It indicates there is spurious MSI/MSIx
  interrupt whose MSI/MSIx message wasn't populated and updated to
  hardware yet.

  (B) We really have 0x0 DMA write transaction and the TCE entries
  wasn't populated and built yet. From the kernel log, we're using
  direct DMA window. So the DMA address shouldn't less than (0x1ul <<
  59).

  Recap the PHB3 diag-data
  ================

  [503195.366438] PHB3 PHB#4 Diag-data (Version: 1)
  [503195.366440] brdgCtl:     00000002
  [503195.366442] UtlSts:      00200000 00000000 00000000
  [503195.366444] RootSts:     00000020 00400000 f0430048 00100147 00000000
  [503195.366446] PhbSts:      0000001c00000000 0000001c00000000
  [503195.366448] Lem:         0000000004000000 42498e327f502eae 0000000000000000
  [503195.366450] InAErr:      0000000000004000 0000000000004000 0000000000000000 0000000000000000
  [503195.366452] PE[  2] A/B: 8000302500000000 8000000000000000

  William helped run following command, which triggers EEH error, which
  has same root cause as we saw before. After that, re-running the
  command won't trigger EEH again:

  # sudo dd if=/dev/nvme1n1 of=/dev/null bs=1k count=512
  # dmesg | grep EEH.*has\ failed
  [ 2377.124104] EEH: This PCI device has failed 1 times in the last hour

  I'm not sure if there is possibility of content in MSIx table is
  overwritten because the device has only one memory BAR. The top 4KB
  region of the BAR is reserved for MSIx table and left space for MMIO
  registers. So we probably need dump the MSIx table to have a check
  when detecting EEH error:

  # sudo lspci -vv -s 0004:01:00.0
  0004:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd \
  NVMe SSD Controller 171X (rev 03) (prog-if 02 [NVM Express])
  	Subsystem: Samsung Electronics Co Ltd Device a809
  	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
  	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
  	Latency: 0
  	Interrupt: pin A routed to IRQ 512
  	Region 0: Memory at 3ff000000000 (64-bit, non-prefetchable) [size=16K]
          :
          :
  	Capabilities: [e0] MSI-X: Enable+ Count=129 Masked-
  		Vector table: BAR=0 offset=00002000
  		PBA: BAR=0 offset=00003000

  > I'm not sure if there is possibility of content in MSIx table is overwritten
  > because the device has only one memory BAR. The top 4KB region of the BAR is
  > reserved for MSIx table and left space for MMIO registers. So we probably
  > need dump the MSIx table to have a check when detecting EEH error:
  > 
  > 	Region 0: Memory at 3ff000000000 (64-bit, non-prefetchable) [size=16K]
  >         :
  >         :
  > 	Capabilities: [e0] MSI-X: Enable+ Count=129 Masked-
  > 		Vector table: BAR=0 offset=00002000
  > 		PBA: BAR=0 offset=00003000

  Actually this adapter has little different BAR mapping.

  BAR0 + 0x0000 - 0x1FFF - 8 KB reserved for controller/device management
  BAR0 + 0x2000 - 0x2FFF - 4K meant for MSI-X table, this device supports 129 vectors so, 0x2000 - 0x280F (2046 bytes) used
  BAR0 + 0x3000 - 0x3FFF - 4K meant for Pending Bit Array (PBA)

  I see 128 MSI-X are used/enabled since this system has 128 CPU threads and NVMe driver allocates one MSI-X for each online CPU/completion Q.
  Admin Q and first I/O Q share the same interrupt.

  http://www.nvmexpress.org/wp-content/uploads/NVM-Express-
  1_2-Gold-20141209.pdf

  I see MSI-X 0 uses 0x1000000000000000 address with MSg. pay load data
  value of 0x0000

  0x3FF000002000  0x1000000000000000 0x0000000000000000 -- MSI-X 0
  0x3FF000002010  0x1000000000000000 0x0000000000000001
  ....
  0x3FF0000021F0  0x1000000000000000 0x000000000000001F
  0x3FF000002200  0x1000000000000200 0x0000000000000000
  0x3FF000002210  0x1000000000000200 0x0000000000000001
  ...
  0x3FF0000023F0  0x1000000000000200 0x000000000000001F
  0x3FF000002400  0x1000000000000400 0x0000000000000000
  0x3FF000002410  0x1000000000000400 0x0000000000000001
  ...
  0x3FF000002600  0x1000000000000600 0x0000000000000000
  0x3FF000002610  0x1000000000000600 0x0000000000000001
  0x3FF000002620  0x1000000000000600 0x0000000000000002
  ...
  0x3FF0000027F0  0x1000000000000600 0x000000000000001F  ---> MSI-X 127
  0x3FF000002800  0x0000000000000000 0x0000000100000000 ---> MSI-X 128 (129th) and not enabled by the driver

  Do you see any implementation issue w/ supporting 129 MSI-X and having
  MSI-X table at BAR0+offset 0x2000?

  Also MSI-X table seems to be intact after EEH recovery.

  The MSIx table looks good except MSIx 128 (129th), which shouldn't have 0x0 as its message address. Once this address is sent to root port and it would trigger the EEH error caused by failing DMA address 0x0.
  Could you please claim when you dumped the MSIx table? If you dumped the table before any data transfer to the device, I would believe the MSIx interrupt#128 was never populated properly.

  I installed one kernel built from 4.0.rc3, which has 4KB system memory
  page size enabled, on William's machine. There're no problems found
  after 5 hours HTX testing. As discussed with Murali: IBM Power is
  using 64KB system page size by default, which brought troubles for
  NVME driver to calculate DMA address written to the device properly.
  More details can be found in drivers/block/nvme-
  core.c::nvme_setup_prps(). I believe Murali is working/testing on fix
  against it and hopefully it will be fixed pretty soon.

  The attached patch will work for 3.19.x kernel or later but won't
  apply on 3.18.x based kernel. Please update your system to latest
  Ubuntu 15.04 that uses 3.19.x and then apply the patch on that kernel.
  nvme_setup_prps() in 3.18.x is different and needs to change quite a
  bit. Let's see whether it is worth to back port or not.

  This patch works as expected. HTX is running over an hour w/o any mis-
  compare. Typically fails right away w/o this fix..

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1442186/+subscriptions