← Back to team overview

kernel-packages team mailing list archive

[Bug 1587295] [NEW] drmgr failed to remove i/o slot

 

You have been subscribed to a public bug:

== Comment: #0 - Minh Nguyen <minhn@xxxxxxxxxx> - 2015-12-04 10:01:38 ==
---Problem Description---
While performing drmgr to remove an IO slot, we encounter a failure:
>pvmctl IOSlot detach --drc-names U78C9.001.WZS005Z-P1-C3 -p id=1
[PVME0105FF05-0187] Command /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 41' -r returned 255. Additional messages: /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 41' -r
Validating PHB DLPAR capability...yes.
Isolation failed for 20000029 with -9001
Valid outstanding translations exist.

/var/log/syslog showed:

Dec  3 15:07:22 yc00sp-neo kernel: [  395.877784] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
Dec  3 15:07:22 yc00sp-neo kernel: [  395.878122] rpaphp: Slot [U78C9.001.WZS005Z-P1-C3] registered
Dec  3 15:07:23 yc00sp-neo kernel: [  396.625406] iommu: Removing device 0001:01:00.0 from group 1
Dec  3 15:07:24 yc00sp-neo kernel: [  397.293386] iommu: Removing device 0001:01:00.1 from group 1
Dec  3 15:07:34 yc00sp-neo kernel: [  407.298765] pci_bus 0001:01: busn_res: [bus 01-ff] is released
Dec  3 15:07:34 yc00sp-neo kernel: [  407.298844] rpadlpar_io: slot PHB 41 removed

/var/log/drmgr showed:

retrieving hotplug nodes
Could not find DRC property group in path: /proc/device-tree/pci@80000002000001b.
hp adapter status for U78C9.001.WZS005Z-P1-C3 is 1
setting hp adapter status to UNCONFIG adapter for U78C9.001.WZS005Z-P1-C3
hp adapter status for U78C9.001.WZS005Z-P1-C3 is 2
Removing device-tree node /proc/device-tree/pci@800000020000029/ethernet@0,1
Removing device-tree node /proc/device-tree/pci@800000020000029/ethernet@0
HPDEV: /sys/bus/pci/devices/0000:50:00.0
       /pci@80000002000001b/usb@0
performing kernel op for PHB 41, file is /sys/bus/pci/slots/control/remove_slot
Removing device-tree node /proc/device-tree/pci@800000020000029
Removing device-tree node /proc/device-tree/interrupt-controller@800000025000029
Releasing drc index 0x20000029
get-sensor for 20000029: 0, 1
Setting isolation state to 'isolate'
Isolation failed for 20000029 with -9001
Valid outstanding translations exist.

The slot has a 10 Gigabit Etherenet-SFP+ SR PCI-E adapter
 
Contact Information = Minh Nguyen (minhn@xxxxxxxxxx) Jeremy Arnold (arnoldje@xxxxxxxxxx)  
 
---uname output---
Linux yc00sp-neo 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 14:49:47 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = 8286-42A 
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
 Run the command:

pvmctl IOSlot detach --drc-names U78C9.001.WZS005Z-P1-C3 -p id=1
 
Userspace tool common name: gdb 
 
The userspace tool has the following bit modes: 64bit 

Userspace rpm: powerpc-ibm-utils

Userspace tool obtained from project website:  na 
 
*Additional Instructions for Minh Nguyen (minhn@xxxxxxxxxx) Jeremy Arnold (arnoldje@xxxxxxxxxx) : 
-Post a private note with access information to the machine that the bug is occuring on.
-Attach ltrace and strace of userspace application.

== Comment: #7 - Carol L. Soto <clsoto@xxxxxxxxxx> - 2016-02-08 16:15:57 ==
I sniff in the /var/log/kern.log.4
I put in /tmp/kern.log.4
I see this 
Dec  3 15:00:51 yc00sp-neo kernel: [    4.762738] ibmvmc: sethmcid: Set HMC ID: "neo 1"
Dec  3 15:00:51 yc00sp-neo kernel: [    4.817873] DCCP: Activated CCID 2 (TCP-like)
Dec  3 15:07:22 yc00sp-neo kernel: [  395.877784] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
Dec  3 15:07:22 yc00sp-neo kernel: [  395.878122] rpaphp: Slot [U78C9.001.WZS005Z-P1-C3] registered
Dec  3 15:07:23 yc00sp-neo kernel: [  396.625406] iommu: Removing device 0001:01:00.0 from group 1
Dec  3 15:07:24 yc00sp-neo kernel: [  397.293386] iommu: Removing device 0001:01:00.1 from group 1
Dec  3 15:07:34 yc00sp-neo kernel: [  407.298765] pci_bus 0001:01: busn_res: [bus 01-ff] is released
Dec  3 15:07:34 yc00sp-neo kernel: [  407.298844] rpadlpar_io: slot PHB 41 removed
~


but I do not see Mellanox traces I only see be2net traces. That is
another device.

== Comment: #15 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-02-18 13:49:26 ==
Looking around the system, I notice that 'lspci' shows no (ethernet) device. I looked at the kernel and the module 'be2net' was still loaded, but had zero dependents. I ran "rmmod be2net" and the module was removed without error. I then ran the pvmctl remove command and it appeared to succeed:

root@cs-tul6-neo:~# pvmctl IOSlot detach --drc-names U78CB.001.WZS00D0-P1-C6 -p id=1
[PVME0105FF05-0187] Command /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 24' -r returned 3. Additional messages: /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 24' -r
Validating PHB DLPAR capability...yes.
root@cs-tul6-neo:~#

and pvmctl io list does not show the device any more.

== Comment: #27 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-02-25 12:30:26 ==
With a point in the right direction from Alexey, I think I've found the problem. The adapter->pcicfg is either derived from the existing map of adapter->db or mapped anew depending on circumstances. However, no record is kept of which was done, and at remove time no attempt is made to release the map. The following debug output from be2net shows the problem:

[   81.383949] be2net 0000:01:00.0: be2net version is 10.6.0.3debug
[   81.383953] be2net : be_probe() entered
[   81.384531] be2net 0000:01:00.0: Using 64-bit direct DMA at offset 800000000000000
[   81.384715] be2net 0000:01:00.0: PCIe error reporting enabled
[   81.384779] be2net : d000080080200000 = pci_iomap(csr)
[   81.384780] be2net : d000080080240000 = pci_iomap(db)
[   81.384782] be2net : d0000800801e4000 = pci_iomap(pcicfg)
[   81.562417] be2net 0000:01:00.0: adapter not in advanced mode
[   81.714383] be2net 0000:01:00.0: FW config: function_mode=0x2003, function_caps=0xf
[   81.778370] be2net 0000:01:00.0: Max: txqs 16, rxqs 5, rss 4, eqs 16, vfs 0
[   81.778373] be2net 0000:01:00.0: Max: uc-macs 30, mc-macs 64, vlans 64
[   81.780257] be2net 0000:01:00.0: enabled 4 MSI-x vector(s) for NIC
[   82.066316] be2net 0000:01:00.0: created 4 TX queue(s)
[   82.146293] be2net 0000:01:00.0: created 5 RX queue(s)
[   82.281405] be2net 0000:01:00.0: FW version is 4.4.180.7
[   82.282109] be2net 0000:01:00.0: HW Flow control - TX:1 RX:1
[   82.283251] be2net 0000:01:00.0: Emulex OneConnect(be3): PF  port 0
[   82.283253] be2net : be_probe() left
[   82.283263] be2net 0000:01:00.1: be2net version is 10.6.0.3debug
[   82.283264] be2net : be_probe() entered
[   82.283769] be2net 0000:01:00.1: Using 64-bit direct DMA at offset 800000000000000
[   82.283952] be2net 0000:01:00.1: PCIe error reporting enabled
[   82.284743] be2net : d0000800802c0000 = pci_iomap(csr)
[   82.284745] be2net : d000080080300000 = pci_iomap(db)
[   82.284747] be2net : d0000800802a0000 = pci_iomap(pcicfg)
[   82.286982] be2net 0000:01:00.0 enp1s0f0: renamed from eth2
[   82.462224] be2net 0000:01:00.1: adapter not in advanced mode
[   82.614194] be2net 0000:01:00.1: FW config: function_mode=0x2003, function_caps=0xf
[   82.678188] be2net 0000:01:00.1: Max: txqs 16, rxqs 5, rss 4, eqs 16, vfs 0
[   82.678191] be2net 0000:01:00.1: Max: uc-macs 30, mc-macs 64, vlans 64
[   82.680083] be2net 0000:01:00.1: enabled 4 MSI-x vector(s) for NIC
[   82.962129] be2net 0000:01:00.1: created 4 TX queue(s)
[   83.042104] be2net 0000:01:00.1: created 5 RX queue(s)
[   83.121652] be2net 0000:01:00.1: FW version is 4.4.180.7
[   83.122356] be2net 0000:01:00.1: HW Flow control - TX:1 RX:1
[   83.123492] be2net 0000:01:00.1: Emulex OneConnect(be3): PF  port 1
[   83.123493] be2net : be_probe() left
[   83.125255] be2net 0000:01:00.1 enp1s0f1: renamed from eth2
[  165.196825] be2net : be_remove() entered
[  165.585166] be2net : pci_iounmap(d000080080200000)
[  165.585172] be2net : pci_iounmap(d000080080240000)
[  165.585423] be2net : be_remove() left
[  165.585638] be2net : be_remove() entered
[  165.981157] be2net : pci_iounmap(d0000800802c0000)
[  165.981163] be2net : pci_iounmap(d000080080300000)
[  165.981415] be2net : be_remove() left

Since the fix is more than simply adding a (unconditional) call to
pci_iounmap(), we probably need to get Emulex involved to see how they
want to fix this.

As an experiment, I added code to track the condition and do the unmap.
However, the remove still fails with the same error message, even though
the pcicfg mapping is now removed. So, there may still be other
resources - or else this was not the cause of the error.

== Comment: #28 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-02-25 12:44:35 ==
Jesse ran the f/w debug again, got this:

Failed with the same return code: looks like two page table entries in there for 21010018
                                                                H S                                                    
                                                                V V C R T G B S L H W I M G N E  UT P        PS   SS  K
                                                                a a h e a r l p p             n  pi p        ai   ei  e
                          Vpn                  RealAddr         l l g f g p t V g                dm          gz   gz  y
==RA=0003FF8200000000==================================================================================================
HPTE     80000020FEDA4700 0013D349C0080120 Phy 8003FF8200100000 X   X X         X     X X X X   000 NAU     64K   1T 00
HPTE     80000020FEDA4D00 0013D349C0080060 Phy 8003FF8200100000 X   X X         X     X X X X   000 NAU     64K   1T 00
=======================================================================================================================
The bold are the virtual page numbers that are still registered

So, what I found does not appear to have been the HPTEs that are causing
the problem - even though it does appear to be a bug in be2net. Back to
hunting down these addresses.

== Comment: #37 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-03-08 07:56:14 ==
The fix is now in kernel.org origin/master commit a69bf3c5b49ef488970c74e26ba0ec12f08491c2

== Comment: #39 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-03-30 15:00:56 ==
I'm not sure what the correct state is. I think I saw notes on another bugzilla asking Cononical to update 15.10, so I wonder what this bug is for. Should it be changed to FIXED awaiting a new kernel from Canonical?

== Comment: #42 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-05-26 16:27:49 ==
This needs to be mirrored to Canonical so they can pull the commit from kernel.org.

== Comment: #43 - Douglas Miller <dougmill@xxxxxxxxxx> - 2016-05-26 16:29:10 ==
 kernel.org origin/master commit a69bf3c5b49ef488970c74e26ba0ec12f08491c2 needs to be pulled into Ubuntu 16.04.1

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bot-comment bugnameltc-133845 severity-critical targetmilestone-inin1604
-- 
drmgr failed to remove i/o slot
https://bugs.launchpad.net/bugs/1587295
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu.