← Back to team overview

kernel-packages team mailing list archive

[Bug 1587295] Re: drmgr failed to remove i/o slot

 

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: linux (Ubuntu)
       Status: New => Triaged

** Changed in: linux (Ubuntu)
     Assignee: Taco Screen team (taco-screen-team) => Canonical Kernel Team (canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1587295

Title:
  drmgr failed to remove i/o slot

Status in linux package in Ubuntu:
  Triaged

Bug description:
  == Comment: #0 - Minh Nguyen <minhn@xxxxxxxxxx> - 2015-12-04 10:01:38 ==
  ---Problem Description---
  While performing drmgr to remove an IO slot, we encounter a failure:
  >pvmctl IOSlot detach --drc-names U78C9.001.WZS005Z-P1-C3 -p id=1
  [PVME0105FF05-0187] Command /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 41' -r returned 255. Additional messages: /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 41' -r
  Validating PHB DLPAR capability...yes.
  Isolation failed for 20000029 with -9001
  Valid outstanding translations exist.

  /var/log/syslog showed:

  Dec  3 15:07:22 yc00sp-neo kernel: [  395.877784] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
  Dec  3 15:07:22 yc00sp-neo kernel: [  395.878122] rpaphp: Slot [U78C9.001.WZS005Z-P1-C3] registered
  Dec  3 15:07:23 yc00sp-neo kernel: [  396.625406] iommu: Removing device 0001:01:00.0 from group 1
  Dec  3 15:07:24 yc00sp-neo kernel: [  397.293386] iommu: Removing device 0001:01:00.1 from group 1
  Dec  3 15:07:34 yc00sp-neo kernel: [  407.298765] pci_bus 0001:01: busn_res: [bus 01-ff] is released
  Dec  3 15:07:34 yc00sp-neo kernel: [  407.298844] rpadlpar_io: slot PHB 41 removed

  /var/log/drmgr showed:

  retrieving hotplug nodes
  Could not find DRC property group in path: /proc/device-tree/pci@80000002000001b.
  hp adapter status for U78C9.001.WZS005Z-P1-C3 is 1
  setting hp adapter status to UNCONFIG adapter for U78C9.001.WZS005Z-P1-C3
  hp adapter status for U78C9.001.WZS005Z-P1-C3 is 2
  Removing device-tree node /proc/device-tree/pci@800000020000029/ethernet@0,1
  Removing device-tree node /proc/device-tree/pci@800000020000029/ethernet@0
  HPDEV: /sys/bus/pci/devices/0000:50:00.0
         /pci@80000002000001b/usb@0
  performing kernel op for PHB 41, file is /sys/bus/pci/slots/control/remove_slot
  Removing device-tree node /proc/device-tree/pci@800000020000029
  Removing device-tree node /proc/device-tree/interrupt-controller@800000025000029
  Releasing drc index 0x20000029
  get-sensor for 20000029: 0, 1
  Setting isolation state to 'isolate'
  Isolation failed for 20000029 with -9001
  Valid outstanding translations exist.

  The slot has a 10 Gigabit Etherenet-SFP+ SR PCI-E adapter
   
  Contact Information = Minh Nguyen (minhn@xxxxxxxxxx) Jeremy Arnold (arnoldje@xxxxxxxxxx)  
   
  ---uname output---
  Linux yc00sp-neo 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 14:49:47 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = 8286-42A 
   
  ---Debugger---
  A debugger is not configured
   
  ---Steps to Reproduce---
   Run the command:

  pvmctl IOSlot detach --drc-names U78C9.001.WZS005Z-P1-C3 -p id=1
   
  Userspace tool common name: gdb 
   
  The userspace tool has the following bit modes: 64bit 

  Userspace rpm: powerpc-ibm-utils

  Userspace tool obtained from project website:  na 
   
  *Additional Instructions for Minh Nguyen (minhn@xxxxxxxxxx) Jeremy Arnold (arnoldje@xxxxxxxxxx) : 
  -Post a private note with access information to the machine that the bug is occuring on.
  -Attach ltrace and strace of userspace application.

  == Comment: #7 - Carol L. Soto <clsoto@xxxxxxxxxx> - 2016-02-08 16:15:57 ==
  I sniff in the /var/log/kern.log.4
  I put in /tmp/kern.log.4
  I see this 
  Dec  3 15:00:51 yc00sp-neo kernel: [    4.762738] ibmvmc: sethmcid: Set HMC ID: "neo 1"
  Dec  3 15:00:51 yc00sp-neo kernel: [    4.817873] DCCP: Activated CCID 2 (TCP-like)
  Dec  3 15:07:22 yc00sp-neo kernel: [  395.877784] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
  Dec  3 15:07:22 yc00sp-neo kernel: [  395.878122] rpaphp: Slot [U78C9.001.WZS005Z-P1-C3] registered
  Dec  3 15:07:23 yc00sp-neo kernel: [  396.625406] iommu: Removing device 0001:01:00.0 from group 1
  Dec  3 15:07:24 yc00sp-neo kernel: [  397.293386] iommu: Removing device 0001:01:00.1 from group 1
  Dec  3 15:07:34 yc00sp-neo kernel: [  407.298765] pci_bus 0001:01: busn_res: [bus 01-ff] is released
  Dec  3 15:07:34 yc00sp-neo kernel: [  407.298844] rpadlpar_io: slot PHB 41 removed
  ~


  but I do not see Mellanox traces I only see be2net traces. That is
  another device.

  == Comment: #15 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-02-18 13:49:26 ==
  Looking around the system, I notice that 'lspci' shows no (ethernet) device. I looked at the kernel and the module 'be2net' was still loaded, but had zero dependents. I ran "rmmod be2net" and the module was removed without error. I then ran the pvmctl remove command and it appeared to succeed:

  root@cs-tul6-neo:~# pvmctl IOSlot detach --drc-names U78CB.001.WZS00D0-P1-C6 -p id=1
  [PVME0105FF05-0187] Command /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 24' -r returned 3. Additional messages: /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 24' -r
  Validating PHB DLPAR capability...yes.
  root@cs-tul6-neo:~#

  and pvmctl io list does not show the device any more.

  == Comment: #27 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-02-25 12:30:26 ==
  With a point in the right direction from Alexey, I think I've found the problem. The adapter->pcicfg is either derived from the existing map of adapter->db or mapped anew depending on circumstances. However, no record is kept of which was done, and at remove time no attempt is made to release the map. The following debug output from be2net shows the problem:

  [   81.383949] be2net 0000:01:00.0: be2net version is 10.6.0.3debug
  [   81.383953] be2net : be_probe() entered
  [   81.384531] be2net 0000:01:00.0: Using 64-bit direct DMA at offset 800000000000000
  [   81.384715] be2net 0000:01:00.0: PCIe error reporting enabled
  [   81.384779] be2net : d000080080200000 = pci_iomap(csr)
  [   81.384780] be2net : d000080080240000 = pci_iomap(db)
  [   81.384782] be2net : d0000800801e4000 = pci_iomap(pcicfg)
  [   81.562417] be2net 0000:01:00.0: adapter not in advanced mode
  [   81.714383] be2net 0000:01:00.0: FW config: function_mode=0x2003, function_caps=0xf
  [   81.778370] be2net 0000:01:00.0: Max: txqs 16, rxqs 5, rss 4, eqs 16, vfs 0
  [   81.778373] be2net 0000:01:00.0: Max: uc-macs 30, mc-macs 64, vlans 64
  [   81.780257] be2net 0000:01:00.0: enabled 4 MSI-x vector(s) for NIC
  [   82.066316] be2net 0000:01:00.0: created 4 TX queue(s)
  [   82.146293] be2net 0000:01:00.0: created 5 RX queue(s)
  [   82.281405] be2net 0000:01:00.0: FW version is 4.4.180.7
  [   82.282109] be2net 0000:01:00.0: HW Flow control - TX:1 RX:1
  [   82.283251] be2net 0000:01:00.0: Emulex OneConnect(be3): PF  port 0
  [   82.283253] be2net : be_probe() left
  [   82.283263] be2net 0000:01:00.1: be2net version is 10.6.0.3debug
  [   82.283264] be2net : be_probe() entered
  [   82.283769] be2net 0000:01:00.1: Using 64-bit direct DMA at offset 800000000000000
  [   82.283952] be2net 0000:01:00.1: PCIe error reporting enabled
  [   82.284743] be2net : d0000800802c0000 = pci_iomap(csr)
  [   82.284745] be2net : d000080080300000 = pci_iomap(db)
  [   82.284747] be2net : d0000800802a0000 = pci_iomap(pcicfg)
  [   82.286982] be2net 0000:01:00.0 enp1s0f0: renamed from eth2
  [   82.462224] be2net 0000:01:00.1: adapter not in advanced mode
  [   82.614194] be2net 0000:01:00.1: FW config: function_mode=0x2003, function_caps=0xf
  [   82.678188] be2net 0000:01:00.1: Max: txqs 16, rxqs 5, rss 4, eqs 16, vfs 0
  [   82.678191] be2net 0000:01:00.1: Max: uc-macs 30, mc-macs 64, vlans 64
  [   82.680083] be2net 0000:01:00.1: enabled 4 MSI-x vector(s) for NIC
  [   82.962129] be2net 0000:01:00.1: created 4 TX queue(s)
  [   83.042104] be2net 0000:01:00.1: created 5 RX queue(s)
  [   83.121652] be2net 0000:01:00.1: FW version is 4.4.180.7
  [   83.122356] be2net 0000:01:00.1: HW Flow control - TX:1 RX:1
  [   83.123492] be2net 0000:01:00.1: Emulex OneConnect(be3): PF  port 1
  [   83.123493] be2net : be_probe() left
  [   83.125255] be2net 0000:01:00.1 enp1s0f1: renamed from eth2
  [  165.196825] be2net : be_remove() entered
  [  165.585166] be2net : pci_iounmap(d000080080200000)
  [  165.585172] be2net : pci_iounmap(d000080080240000)
  [  165.585423] be2net : be_remove() left
  [  165.585638] be2net : be_remove() entered
  [  165.981157] be2net : pci_iounmap(d0000800802c0000)
  [  165.981163] be2net : pci_iounmap(d000080080300000)
  [  165.981415] be2net : be_remove() left

  Since the fix is more than simply adding a (unconditional) call to
  pci_iounmap(), we probably need to get Emulex involved to see how they
  want to fix this.

  As an experiment, I added code to track the condition and do the
  unmap. However, the remove still fails with the same error message,
  even though the pcicfg mapping is now removed. So, there may still be
  other resources - or else this was not the cause of the error.

  == Comment: #28 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-02-25 12:44:35 ==
  Jesse ran the f/w debug again, got this:

  Failed with the same return code: looks like two page table entries in there for 21010018
                                                                  H S                                                    
                                                                  V V C R T G B S L H W I M G N E  UT P        PS   SS  K
                                                                  a a h e a r l p p             n  pi p        ai   ei  e
                            Vpn                  RealAddr         l l g f g p t V g                dm          gz   gz  y
  ==RA=0003FF8200000000==================================================================================================
  HPTE     80000020FEDA4700 0013D349C0080120 Phy 8003FF8200100000 X   X X         X     X X X X   000 NAU     64K   1T 00
  HPTE     80000020FEDA4D00 0013D349C0080060 Phy 8003FF8200100000 X   X X         X     X X X X   000 NAU     64K   1T 00
  =======================================================================================================================
  The bold are the virtual page numbers that are still registered

  So, what I found does not appear to have been the HPTEs that are
  causing the problem - even though it does appear to be a bug in
  be2net. Back to hunting down these addresses.

  == Comment: #37 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-03-08 07:56:14 ==
  The fix is now in kernel.org origin/master commit a69bf3c5b49ef488970c74e26ba0ec12f08491c2

  == Comment: #39 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-03-30 15:00:56 ==
  I'm not sure what the correct state is. I think I saw notes on another bugzilla asking Cononical to update 15.10, so I wonder what this bug is for. Should it be changed to FIXED awaiting a new kernel from Canonical?

  == Comment: #42 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-05-26 16:27:49 ==
  This needs to be mirrored to Canonical so they can pull the commit from kernel.org.

  == Comment: #43 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-05-26 16:29:10 ==
   kernel.org origin/master commit a69bf3c5b49ef488970c74e26ba0ec12f08491c2 needs to be pulled into Ubuntu 16.04.1

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1587295/+subscriptions