kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #155581
[Bug 1533351] Re: DLPAR operation fails on Bell adapter with Ubuntu 14.04.3 OS
** Changed in: linux-lts-utopic (Ubuntu)
Status: New => Invalid
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-lts-utopic in Ubuntu.
https://bugs.launchpad.net/bugs/1533351
Title:
DLPAR operation fails on Bell adapter with Ubuntu 14.04.3 OS
Status in linux-lts-utopic package in Ubuntu:
Invalid
Bug description:
== Comment: #0 - HARSHA THYAGARAJA <hathyaga@xxxxxxxxxx> - 2015-11-06 04:10:32 ==
---Problem Description---
DLPAR operation fails on Bell adapter
Contact Information = hathyaga@xxxxxxxxxx, iranna.ankad@xxxxxxxxxx
---uname output---
Linux tuletapio1-lp5 3.13.0-67-generic #110-Ubuntu SMP Fri Oct 23 13:24:51 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
Machine Type = 8286-41A
---Steps to Reproduce---
Necessary packages installed are:
devices.chrp.base.servicerm_2.5.0.1-15111_ppc64el.deb
dynamicrm_2.0.1-3_ppc64el.deb
rsct.core_3.2.0.6-15111_ppc64el.deb
rsct.core.utils_3.2.0.6-15111_ppc64el.deb
src_3.2.0.6-15111_ppc64el.deb
On the OS:
root@tuletapio1-lp5:~# startsrc -g rsct
0513-059 The ctcas Subsystem has been started. Subsystem PID is 1382.
0513-029 The ctrmc Subsystem is already active.
Multiple instances are not supported.
root@tuletapio1-lp5:~# startsrc -g rsct_rm
0513-029 The IBM.MgmtDomainRM Subsystem is already active.
Multiple instances are not supported.
0513-059 The IBM.ERRM Subsystem has been started. Subsystem PID is 1389.
0513-029 The IBM.HostRM Subsystem is already active.
Multiple instances are not supported.
0513-059 The IBM.AuditRM Subsystem has been started. Subsystem PID is 1390.
0513-059 The IBM.SensorRM Subsystem has been started. Subsystem PID is 1393.
0513-029 The IBM.DRM Subsystem is already active.
Multiple instances are not supported.
0513-029 The IBM.ServiceRM Subsystem is already active.
Multiple instances are not supported.
root@tuletapio1-lp5:~# lssrc -a
Subsystem Group PID Status
ctrmc rsct 921 active
IBM.DRM rsct_rm 1025 active
IBM.MgmtDomainRM rsct_rm 1130 active
IBM.HostRM rsct_rm 1143 active
IBM.ServiceRM rsct_rm 1183 active
ctcas rsct 1382 active
IBM.ERRM rsct_rm 1389 active
IBM.AuditRM rsct_rm 1390 active
IBM.SensorRM rsct_rm 1393 active
In the HMC:
Run the command:
hscroot@pwrio-hmc:~> lshwres -r io -m tuletapio1-fsp --rsubtype slot --filter "lpar_names=tuletapio1-lp5-iranna"
unit_phys_loc=U78C9.001.WZS00CH,bus_id=24,phys_loc=C6,drc_index=21010018,lpar_name=tuletapio1-lp5-iranna,lpar_id=5,slot_io_pool_id=none,description=Quad Async EIA-232 PCI-Express Adapter,feature_codes=none,pci_vendor_id=114F,pci_device_id=00B6,pci_subs_vendor_id=114F,pci_subs_device_id=00B6,pci_class=0000,pci_revision_id=AA,bus_grouping=0,iop=0,parent_slot_drc_index=none,drc_name=U78C9.001.WZS00CH-P1-C6,interposer_present=0,interposer_pcie=0,lpar_assignment_capable=1,dynamic_lpar_assignment_capable=1
hscroot@pwrio-hmc:~> chhwres -r io -m tuletapio1-fsp -o r --id 5 -l 21010018
HSCL2929 The dynamic removal of I/O resources failed: The I/O slot dynamic partitioning operation failed. Here are the I/O slot IDs that failed and the reasons for failure:
Validating PHB DLPAR capability...yes.
failed to open /sys/bus/pci/slots/U78C9.001.WZS00CH-P1-C6/power: No such file or directory
failed to disable hotplug children
kernel remove failed for PHB 24, rc = -1
Observed in the terminal:
Nov 4 05:26:43 tuletapio1-lp5 kernel: [ 553.125671] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:44 tuletapio1-lp5 kernel: [ 554.125766] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:45 tuletapio1-lp5 kernel: [ 555.125862] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:46 tuletapio1-lp5 kernel: [ 556.125957] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:47 tuletapio1-lp5 kernel: [ 557.126052] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:48 tuletapio1-lp5 kernel: [ 558.126148] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:49 tuletapio1-lp5 kernel: [ 559.126243] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:50 tuletapio1-lp5 kernel: [ 560.126338] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:51 tuletapio1-lp5 kernel: [ 561.126432] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:52 tuletapio1-lp5 kernel: [ 562.126527] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:53 tuletapio1-lp5 kernel: [ 563.126622] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:54 tuletapio1-lp5 kernel: [ 564.126717] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:55 tuletapio1-lp5 kernel: [ 565.126813] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:56 tuletapio1-lp5 kernel: [ 566.126908] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:57 tuletapio1-lp5 kernel: [ 567.127004] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:58 tuletapio1-lp5 kernel: [ 568.127099] rpadlpar_io: slot PHB 24 removed
Nov 4 05:26:59 tuletapio1-lp5 kernel: [ 569.127193] rpadlpar_io: slot PHB 24 removed
The terminal dumps above messages continuously that the adapter has
been removed but lspci -nn still showed the entry for the adapter
root@tuletapio1-lp5:~# lspci -nn
01:00.0 PCI bridge [0604]: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge [10b5:8112] (rev ff)
02:00.0 Serial controller [0700]: Digi International Digi Neo 4 (IBM version) [114f:00f4] (rev ff)
Details of the system:
IP: 9.40.192.64
creds: root/ltcnetdd
*Additional Instructions for hathyaga@xxxxxxxxxx, iranna.ankad@xxxxxxxxxx:
-Post a private note with access information to the machine that the bug is occuring on.
== Comment: #1 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-11-20 11:10:57 ==
Interesting. Looking at the drmgr logs the DLPAR remove of the PHB is not failing because of an error but because the drmgr is timing out before it is able to complete the request. I am building the latest upstream code and will take a look as to why the request is timing out, this should be able to complete within the five minute timeout given.
########## Nov 04 05:09:32 2015 ##########
drmgr: -r -c phb -s PHB 24 -w 5 -d 1
Validating PHB DLPAR capability...yes.
Getting node types 0x00000010
DR nodes list
==============
/proc/device-tree/pci@800000020000018:
drc index: 0x20000018 description: Unknown slot type
drc name: PHB 24
loc code: U78C9.001.WZS00CH-P1
/proc/device-tree/pci@800000020000018:
drc index: 0x22010018 description: PCI-E capable, Rev 3, 16x lanes with 16x lanes connected
drc name: U78C9.001.WZS00CH-P1-C6
loc code: U78C9.001.WZS00CH-P1
Retrieving hotplug nodes
Could not find DRC property group in path: /proc/device-tree/pci@800000020000018/pci@0.
setting hp adapter status to UNCONFIG adapter for U78C9.001.WZS00CH-P1-C6
failed to open /sys/bus/pci/slots/U78C9.001.WZS00CH-P1-C6/power: No such file or directory
failed to disable hotplug children
Removing device-tree node /proc/device-tree/pci@800000020000018/pci@0/serial@0
Removing device-tree node /proc/device-tree/pci@800000020000018/pci@0
HPDEV: /sys/bus/pci/devices/0000:01:00.0
/pci@800000020000018/pci@0
HPDEV: /sys/bus/pci/devices/0000:02:00.0
/pci@800000020000018/pci@0/serial@0
performing kernel op for PHB 24, file is /sys/bus/pci/slots/control/remove_slot
Drmgr has exceeded its specified wait time and will not continue
kernel remove failed for PHB 24, rc = -1
########## Nov 04 05:14:32 2015 ##########
== Comment: #3 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-12-01 11:08:57 ==
When looking at this bz prior to the Thanksgiving break I was noticing that the hotplug slots under this PHB are not getting registered by the rpadlpar_io kernel module (this is where we handle pci hotplug on Power). This results in them not getting removed when we go to remove the PHB and resulting in the scenario we are seeing. The continuous output of the "PHB 245 Removed" message.
Can anyone comment on whether this issue is seen on any other systems
or on any other distros?
== Comment: #4 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-12-08 12:09:13 ==
Updates from further investigation into this issue.
This does not appear to be a drmgr issue. I was able to boot a 4.2
kernel on the system and then add and remove the adapter without any
problems.
It appears the reason the dlpar add of the adapter is failing is
because the device tree gets set up wrong. In the process of adding
the adapter the first update to the device tree is to add the
interrupt controller for the PHB, afterwards we add the PHB itself.
When the PHB is added the kernel is putting the PHB under the
interrupt controller instead of in the root of the device tree where
it belongs. This causes the drmgr command to think a failure occurs
because it cannot find the PHB after adding it to the device tree, it
should not be under the interrupt controller and we do not look for it
there.
As mentioned above, the same drmgr command fails on the stock kernel
and works on a 4.2 kernel. Next step is to determine why the PHB is
being put under the interrupt controller instead of the root node.
== Comment: #5 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2016-01-12 11:58:18 ==
The fix for this issue is already upstream in commit 99de64984c3a7c9bf56a50e6dcc51006c9485620
OF: fix of_find_node_by_path() assumption that of_allnodes is root
of_find_node_by_path() is borked because of_allnodes is not guaranteed to
contain the root of the tree after using any of the dynamic update functions
because some other nodes ends up as of_allnodes.
Fixes: c22e650e66b8 of: Make of_find_node_by_path() handle /aliases
Reported-by: pantelis.antoniou@xxxxxxxxxxxx
Signed-off-by: Frank Rowand <frank.rowand@xxxxxxxxxxxxxx>
Signed-off-by: Rob Herring <robh@xxxxxxxxxx>
Attached is a backport of the patch.
== Comment: #7 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2016-01-12 12:13:10 ==
This patch is needed to avoid breaking DLPAR capabilities on he power platforms.
Without this patch the DLPAR capabilities of Power platforms to add
devices is broken.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-lts-utopic/+bug/1533351/+subscriptions