← Back to team overview

kernel-packages team mailing list archive

[Bug 1533351] [NEW] DLPAR operation fails on Bell adapter with Ubuntu 14.04.3 OS

 

You have been subscribed to a public bug:

== Comment: #0 - HARSHA THYAGARAJA <hathyaga@xxxxxxxxxx> - 2015-11-06 04:10:32 ==
---Problem Description---
DLPAR operation fails on Bell adapter 
 
Contact Information = hathyaga@xxxxxxxxxx, iranna.ankad@xxxxxxxxxx 
 
---uname output---
Linux tuletapio1-lp5 3.13.0-67-generic #110-Ubuntu SMP Fri Oct 23 13:24:51 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = 8286-41A 
 
---Steps to Reproduce---

 Necessary packages installed are:
devices.chrp.base.servicerm_2.5.0.1-15111_ppc64el.deb
dynamicrm_2.0.1-3_ppc64el.deb
rsct.core_3.2.0.6-15111_ppc64el.deb
rsct.core.utils_3.2.0.6-15111_ppc64el.deb
src_3.2.0.6-15111_ppc64el.deb

On the OS:
 root@tuletapio1-lp5:~# startsrc -g rsct
0513-059 The ctcas Subsystem has been started. Subsystem PID is 1382.
0513-029 The ctrmc Subsystem is already active.
Multiple instances are not supported.


root@tuletapio1-lp5:~# startsrc -g rsct_rm
0513-029 The IBM.MgmtDomainRM Subsystem is already active.
Multiple instances are not supported.
0513-059 The IBM.ERRM Subsystem has been started. Subsystem PID is 1389.
0513-029 The IBM.HostRM Subsystem is already active.
Multiple instances are not supported.
0513-059 The IBM.AuditRM Subsystem has been started. Subsystem PID is 1390.
0513-059 The IBM.SensorRM Subsystem has been started. Subsystem PID is 1393.
0513-029 The IBM.DRM Subsystem is already active.
Multiple instances are not supported.
0513-029 The IBM.ServiceRM Subsystem is already active.
Multiple instances are not supported.


 
root@tuletapio1-lp5:~# lssrc -a
Subsystem         Group            PID     Status 
 ctrmc            rsct             921     active
 IBM.DRM          rsct_rm          1025    active
 IBM.MgmtDomainRM rsct_rm          1130    active
 IBM.HostRM       rsct_rm          1143    active
 IBM.ServiceRM    rsct_rm          1183    active
 ctcas            rsct             1382    active
 IBM.ERRM         rsct_rm          1389    active
 IBM.AuditRM      rsct_rm          1390    active
 IBM.SensorRM     rsct_rm          1393    active


In the HMC:
Run the command: 
hscroot@pwrio-hmc:~> lshwres -r io -m tuletapio1-fsp --rsubtype slot --filter "lpar_names=tuletapio1-lp5-iranna"
unit_phys_loc=U78C9.001.WZS00CH,bus_id=24,phys_loc=C6,drc_index=21010018,lpar_name=tuletapio1-lp5-iranna,lpar_id=5,slot_io_pool_id=none,description=Quad Async EIA-232 PCI-Express Adapter,feature_codes=none,pci_vendor_id=114F,pci_device_id=00B6,pci_subs_vendor_id=114F,pci_subs_device_id=00B6,pci_class=0000,pci_revision_id=AA,bus_grouping=0,iop=0,parent_slot_drc_index=none,drc_name=U78C9.001.WZS00CH-P1-C6,interposer_present=0,interposer_pcie=0,lpar_assignment_capable=1,dynamic_lpar_assignment_capable=1


hscroot@pwrio-hmc:~> chhwres -r io -m tuletapio1-fsp -o r --id 5 -l 21010018


HSCL2929 The dynamic removal of I/O resources failed: The I/O slot dynamic partitioning operation failed.  Here are the I/O slot IDs that failed and the reasons for failure:
   
Validating PHB DLPAR capability...yes.
failed to open /sys/bus/pci/slots/U78C9.001.WZS00CH-P1-C6/power: No such file or directory
failed to disable hotplug children
kernel remove failed for PHB 24, rc = -1


Observed in the terminal:

Nov  4 05:26:43 tuletapio1-lp5 kernel: [  553.125671] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:44 tuletapio1-lp5 kernel: [  554.125766] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:45 tuletapio1-lp5 kernel: [  555.125862] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:46 tuletapio1-lp5 kernel: [  556.125957] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:47 tuletapio1-lp5 kernel: [  557.126052] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:48 tuletapio1-lp5 kernel: [  558.126148] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:49 tuletapio1-lp5 kernel: [  559.126243] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:50 tuletapio1-lp5 kernel: [  560.126338] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:51 tuletapio1-lp5 kernel: [  561.126432] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:52 tuletapio1-lp5 kernel: [  562.126527] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:53 tuletapio1-lp5 kernel: [  563.126622] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:54 tuletapio1-lp5 kernel: [  564.126717] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:55 tuletapio1-lp5 kernel: [  565.126813] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:56 tuletapio1-lp5 kernel: [  566.126908] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:57 tuletapio1-lp5 kernel: [  567.127004] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:58 tuletapio1-lp5 kernel: [  568.127099] rpadlpar_io: slot PHB 24 removed
Nov  4 05:26:59 tuletapio1-lp5 kernel: [  569.127193] rpadlpar_io: slot PHB 24 removed

The terminal dumps above messages continuously that the adapter has been
removed but lspci -nn still showed the entry for the adapter

root@tuletapio1-lp5:~# lspci -nn
01:00.0 PCI bridge [0604]: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge [10b5:8112] (rev ff)
02:00.0 Serial controller [0700]: Digi International Digi Neo 4 (IBM version) [114f:00f4] (rev ff)


Details of the system:
IP: 9.40.192.64
creds: root/ltcnetdd

 
*Additional Instructions for hathyaga@xxxxxxxxxx, iranna.ankad@xxxxxxxxxx: 
-Post a private note with access information to the machine that the bug is occuring on.

== Comment: #1 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-11-20 11:10:57 ==
Interesting. Looking at the drmgr logs the DLPAR remove of the PHB is not failing because of an error but because the drmgr is timing out before it is able to complete the request. I am building the latest upstream code and will take a look as to why the request is timing out, this should be able to complete within  the five minute timeout given.

########## Nov 04 05:09:32 2015 ##########
drmgr: -r -c phb -s PHB 24 -w 5 -d 1 
Validating PHB DLPAR capability...yes.
Getting node types 0x00000010

DR nodes list
==============
/proc/device-tree/pci@800000020000018: 
        drc index: 0x20000018        description: Unknown slot type
        drc name: PHB 24
        loc code: U78C9.001.WZS00CH-P1
/proc/device-tree/pci@800000020000018: 
        drc index: 0x22010018        description: PCI-E capable, Rev 3, 16x lanes with 16x lanes connected
        drc name: U78C9.001.WZS00CH-P1-C6
        loc code: U78C9.001.WZS00CH-P1

Retrieving hotplug nodes
Could not find DRC property group in path: /proc/device-tree/pci@800000020000018/pci@0.
setting hp adapter status to UNCONFIG adapter for U78C9.001.WZS00CH-P1-C6
failed to open /sys/bus/pci/slots/U78C9.001.WZS00CH-P1-C6/power: No such file or directory
failed to disable hotplug children
Removing device-tree node /proc/device-tree/pci@800000020000018/pci@0/serial@0
Removing device-tree node /proc/device-tree/pci@800000020000018/pci@0
HPDEV: /sys/bus/pci/devices/0000:01:00.0
       /pci@800000020000018/pci@0
HPDEV: /sys/bus/pci/devices/0000:02:00.0
       /pci@800000020000018/pci@0/serial@0
performing kernel op for PHB 24, file is /sys/bus/pci/slots/control/remove_slot
Drmgr has exceeded its specified wait time and will not continue
kernel remove failed for PHB 24, rc = -1
########## Nov 04 05:14:32 2015 ##########

== Comment: #3 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-12-01 11:08:57 ==
When looking at this bz prior to the Thanksgiving break I was noticing that the hotplug slots under this PHB are not getting registered by the rpadlpar_io kernel module (this is where we handle pci hotplug on Power). This results in them not getting removed when we go to remove the PHB and resulting in the scenario we are seeing. The continuous output of the "PHB 245 Removed" message.

Can anyone comment on whether this issue is seen on any other systems or
on any other distros?

== Comment: #4 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-12-08 12:09:13 ==
Updates from further investigation into this issue.

This does not appear to be a drmgr issue. I was able to boot a 4.2
kernel on the system and then add and remove the adapter without any
problems.

It appears the reason the dlpar add of the adapter is failing is because
the device tree gets set up wrong. In the process of adding the adapter
the first update to the device tree is to add the interrupt controller
for the PHB, afterwards we add the PHB itself. When the PHB is added the
kernel is putting the PHB under the interrupt controller instead of in
the root of the device tree where it belongs. This causes the drmgr
command to think a failure occurs because it cannot find the PHB after
adding it to the device tree, it should not be under the interrupt
controller and we do not look for it there.

As mentioned above, the same drmgr command fails on the stock kernel and
works on a 4.2 kernel. Next step is to determine why the PHB is being
put under the interrupt controller instead of the root node.

== Comment: #5 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2016-01-12 11:58:18 ==
The fix for this issue is already upstream in commit 	99de64984c3a7c9bf56a50e6dcc51006c9485620

OF: fix of_find_node_by_path() assumption that of_allnodes is root
of_find_node_by_path() is borked because of_allnodes is not guaranteed to
contain the root of the tree after using any of the dynamic update functions
because some other nodes ends up as of_allnodes.

Fixes: c22e650e66b8 of: Make of_find_node_by_path() handle /aliases
Reported-by: pantelis.antoniou@xxxxxxxxxxxx
Signed-off-by: Frank Rowand <frank.rowand@xxxxxxxxxxxxxx>
Signed-off-by: Rob Herring <robh@xxxxxxxxxx>

Attached is a backport of the patch.

== Comment: #7 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2016-01-12 12:13:10 ==
This patch is needed to avoid breaking DLPAR capabilities on he power platforms.

Without this patch the DLPAR capabilities of Power platforms to add
devices is broken.

** Affects: linux-lts-utopic (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-132852 severity-critical targetmilestone-inin14044
-- 
DLPAR operation fails on Bell adapter with Ubuntu 14.04.3 OS
https://bugs.launchpad.net/bugs/1533351
You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-lts-utopic in Ubuntu.