← Back to team overview

kernel-packages team mailing list archive

[Bug 1533351] Re: DLPAR operation fails on Bell adapter with Ubuntu 14.04.3 OS

 

** Changed in: linux-lts-utopic (Ubuntu)
     Assignee: Taco Screen team (taco-screen-team) => (unassigned)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-lts-utopic in Ubuntu.
https://bugs.launchpad.net/bugs/1533351

Title:
  DLPAR operation fails on Bell adapter with Ubuntu 14.04.3 OS

Status in linux-lts-utopic package in Ubuntu:
  Invalid

Bug description:
  == Comment: #0 - HARSHA THYAGARAJA <hathyaga@xxxxxxxxxx> - 2015-11-06 04:10:32 ==
  ---Problem Description---
  DLPAR operation fails on Bell adapter 
   
  Contact Information = hathyaga@xxxxxxxxxx, iranna.ankad@xxxxxxxxxx 
   
  ---uname output---
  Linux tuletapio1-lp5 3.13.0-67-generic #110-Ubuntu SMP Fri Oct 23 13:24:51 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
   
  Machine Type = 8286-41A 
   
  ---Steps to Reproduce---

   Necessary packages installed are:
  devices.chrp.base.servicerm_2.5.0.1-15111_ppc64el.deb
  dynamicrm_2.0.1-3_ppc64el.deb
  rsct.core_3.2.0.6-15111_ppc64el.deb
  rsct.core.utils_3.2.0.6-15111_ppc64el.deb
  src_3.2.0.6-15111_ppc64el.deb

  On the OS:
   root@tuletapio1-lp5:~# startsrc -g rsct
  0513-059 The ctcas Subsystem has been started. Subsystem PID is 1382.
  0513-029 The ctrmc Subsystem is already active.
  Multiple instances are not supported.


  root@tuletapio1-lp5:~# startsrc -g rsct_rm
  0513-029 The IBM.MgmtDomainRM Subsystem is already active.
  Multiple instances are not supported.
  0513-059 The IBM.ERRM Subsystem has been started. Subsystem PID is 1389.
  0513-029 The IBM.HostRM Subsystem is already active.
  Multiple instances are not supported.
  0513-059 The IBM.AuditRM Subsystem has been started. Subsystem PID is 1390.
  0513-059 The IBM.SensorRM Subsystem has been started. Subsystem PID is 1393.
  0513-029 The IBM.DRM Subsystem is already active.
  Multiple instances are not supported.
  0513-029 The IBM.ServiceRM Subsystem is already active.
  Multiple instances are not supported.

  
   
  root@tuletapio1-lp5:~# lssrc -a
  Subsystem         Group            PID     Status 
   ctrmc            rsct             921     active
   IBM.DRM          rsct_rm          1025    active
   IBM.MgmtDomainRM rsct_rm          1130    active
   IBM.HostRM       rsct_rm          1143    active
   IBM.ServiceRM    rsct_rm          1183    active
   ctcas            rsct             1382    active
   IBM.ERRM         rsct_rm          1389    active
   IBM.AuditRM      rsct_rm          1390    active
   IBM.SensorRM     rsct_rm          1393    active


  In the HMC:
  Run the command: 
  hscroot@pwrio-hmc:~> lshwres -r io -m tuletapio1-fsp --rsubtype slot --filter "lpar_names=tuletapio1-lp5-iranna"
  unit_phys_loc=U78C9.001.WZS00CH,bus_id=24,phys_loc=C6,drc_index=21010018,lpar_name=tuletapio1-lp5-iranna,lpar_id=5,slot_io_pool_id=none,description=Quad Async EIA-232 PCI-Express Adapter,feature_codes=none,pci_vendor_id=114F,pci_device_id=00B6,pci_subs_vendor_id=114F,pci_subs_device_id=00B6,pci_class=0000,pci_revision_id=AA,bus_grouping=0,iop=0,parent_slot_drc_index=none,drc_name=U78C9.001.WZS00CH-P1-C6,interposer_present=0,interposer_pcie=0,lpar_assignment_capable=1,dynamic_lpar_assignment_capable=1

  
  hscroot@pwrio-hmc:~> chhwres -r io -m tuletapio1-fsp -o r --id 5 -l 21010018


  HSCL2929 The dynamic removal of I/O resources failed: The I/O slot dynamic partitioning operation failed.  Here are the I/O slot IDs that failed and the reasons for failure:
     
  Validating PHB DLPAR capability...yes.
  failed to open /sys/bus/pci/slots/U78C9.001.WZS00CH-P1-C6/power: No such file or directory
  failed to disable hotplug children
  kernel remove failed for PHB 24, rc = -1

  
  Observed in the terminal:

  Nov  4 05:26:43 tuletapio1-lp5 kernel: [  553.125671] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:44 tuletapio1-lp5 kernel: [  554.125766] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:45 tuletapio1-lp5 kernel: [  555.125862] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:46 tuletapio1-lp5 kernel: [  556.125957] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:47 tuletapio1-lp5 kernel: [  557.126052] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:48 tuletapio1-lp5 kernel: [  558.126148] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:49 tuletapio1-lp5 kernel: [  559.126243] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:50 tuletapio1-lp5 kernel: [  560.126338] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:51 tuletapio1-lp5 kernel: [  561.126432] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:52 tuletapio1-lp5 kernel: [  562.126527] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:53 tuletapio1-lp5 kernel: [  563.126622] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:54 tuletapio1-lp5 kernel: [  564.126717] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:55 tuletapio1-lp5 kernel: [  565.126813] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:56 tuletapio1-lp5 kernel: [  566.126908] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:57 tuletapio1-lp5 kernel: [  567.127004] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:58 tuletapio1-lp5 kernel: [  568.127099] rpadlpar_io: slot PHB 24 removed
  Nov  4 05:26:59 tuletapio1-lp5 kernel: [  569.127193] rpadlpar_io: slot PHB 24 removed

  The terminal dumps above messages continuously that the adapter has
  been removed but lspci -nn still showed the entry for the adapter

  root@tuletapio1-lp5:~# lspci -nn
  01:00.0 PCI bridge [0604]: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge [10b5:8112] (rev ff)
  02:00.0 Serial controller [0700]: Digi International Digi Neo 4 (IBM version) [114f:00f4] (rev ff)

  
  Details of the system:
  IP: 9.40.192.64
  creds: root/ltcnetdd

   
  *Additional Instructions for hathyaga@xxxxxxxxxx, iranna.ankad@xxxxxxxxxx: 
  -Post a private note with access information to the machine that the bug is occuring on.

  == Comment: #1 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-11-20 11:10:57 ==
  Interesting. Looking at the drmgr logs the DLPAR remove of the PHB is not failing because of an error but because the drmgr is timing out before it is able to complete the request. I am building the latest upstream code and will take a look as to why the request is timing out, this should be able to complete within  the five minute timeout given.

  ########## Nov 04 05:09:32 2015 ##########
  drmgr: -r -c phb -s PHB 24 -w 5 -d 1 
  Validating PHB DLPAR capability...yes.
  Getting node types 0x00000010

  DR nodes list
  ==============
  /proc/device-tree/pci@800000020000018: 
          drc index: 0x20000018        description: Unknown slot type
          drc name: PHB 24
          loc code: U78C9.001.WZS00CH-P1
  /proc/device-tree/pci@800000020000018: 
          drc index: 0x22010018        description: PCI-E capable, Rev 3, 16x lanes with 16x lanes connected
          drc name: U78C9.001.WZS00CH-P1-C6
          loc code: U78C9.001.WZS00CH-P1

  Retrieving hotplug nodes
  Could not find DRC property group in path: /proc/device-tree/pci@800000020000018/pci@0.
  setting hp adapter status to UNCONFIG adapter for U78C9.001.WZS00CH-P1-C6
  failed to open /sys/bus/pci/slots/U78C9.001.WZS00CH-P1-C6/power: No such file or directory
  failed to disable hotplug children
  Removing device-tree node /proc/device-tree/pci@800000020000018/pci@0/serial@0
  Removing device-tree node /proc/device-tree/pci@800000020000018/pci@0
  HPDEV: /sys/bus/pci/devices/0000:01:00.0
         /pci@800000020000018/pci@0
  HPDEV: /sys/bus/pci/devices/0000:02:00.0
         /pci@800000020000018/pci@0/serial@0
  performing kernel op for PHB 24, file is /sys/bus/pci/slots/control/remove_slot
  Drmgr has exceeded its specified wait time and will not continue
  kernel remove failed for PHB 24, rc = -1
  ########## Nov 04 05:14:32 2015 ##########

  == Comment: #3 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-12-01 11:08:57 ==
  When looking at this bz prior to the Thanksgiving break I was noticing that the hotplug slots under this PHB are not getting registered by the rpadlpar_io kernel module (this is where we handle pci hotplug on Power). This results in them not getting removed when we go to remove the PHB and resulting in the scenario we are seeing. The continuous output of the "PHB 245 Removed" message.

  Can anyone comment on whether this issue is seen on any other systems
  or on any other distros?

  == Comment: #4 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2015-12-08 12:09:13 ==
  Updates from further investigation into this issue.

  This does not appear to be a drmgr issue. I was able to boot a 4.2
  kernel on the system and then add and remove the adapter without any
  problems.

  It appears the reason the dlpar add of the adapter is failing is
  because the device tree gets set up wrong. In the process of adding
  the adapter the first update to the device tree is to add the
  interrupt controller for the PHB, afterwards we add the PHB itself.
  When the PHB is added the kernel is putting the PHB under the
  interrupt controller instead of in the root of the device tree where
  it belongs. This causes the drmgr command to think a failure occurs
  because it cannot find the PHB after adding it to the device tree, it
  should not be under the interrupt controller and we do not look for it
  there.

  As mentioned above, the same drmgr command fails on the stock kernel
  and works on a 4.2 kernel. Next step is to determine why the PHB is
  being put under the interrupt controller instead of the root node.

  == Comment: #5 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2016-01-12 11:58:18 ==
  The fix for this issue is already upstream in commit 	99de64984c3a7c9bf56a50e6dcc51006c9485620

  OF: fix of_find_node_by_path() assumption that of_allnodes is root
  of_find_node_by_path() is borked because of_allnodes is not guaranteed to
  contain the root of the tree after using any of the dynamic update functions
  because some other nodes ends up as of_allnodes.

  Fixes: c22e650e66b8 of: Make of_find_node_by_path() handle /aliases
  Reported-by: pantelis.antoniou@xxxxxxxxxxxx
  Signed-off-by: Frank Rowand <frank.rowand@xxxxxxxxxxxxxx>
  Signed-off-by: Rob Herring <robh@xxxxxxxxxx>

  Attached is a backport of the patch.

  == Comment: #7 - Nathan D. Fontenot <nfonteno@xxxxxxxxxx> - 2016-01-12 12:13:10 ==
  This patch is needed to avoid breaking DLPAR capabilities on he power platforms.

  Without this patch the DLPAR capabilities of Power platforms to add
  devices is broken.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-lts-utopic/+bug/1533351/+subscriptions