kernel-packages team mailing list archive

Thread
Date
[Bug 1587316] Comment bridged from LTC Bugzilla

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: bugproxy <bugproxy@xxxxxxxxxx>
Date: Tue, 07 Jun 2016 12:30:09 -0000
Reply-to: Bug 1587316 <1587316@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
------- Comment From cdeadmin@xxxxxxxxxx 2016-06-07 08:23 EDT-------
===================================END=================================== State: Verify by: cde00 on 31 May 2016 03:43:19 ====

== Comment: #1 - Application Cdeadmin <cdeadmin@xxxxxxxxxx> - 2016-03-21
15:55:11 ====== State: Verify by: cde00 on 31 May 2016 04:07:26 ====

==== State: Verify by: byrneadw on 01 June 2016 11:03:58 ====

I loaded the test packages and can now successfully run HTX, I am not
seeing EEH errors anymore but I do still see these "FLOGI failure
Status:x3/x103 TMO:x14" errors.

2) from #1 execute ssh root@rcx2c360 (password is PASSW0RD)
==== State: Verify by: byrneadw on 01 June 2016 11:07:41 ====

I loaded the test packages and can now successfully run HTX, I am not seeing EEH errors anymore but I do still see these "FLOGI failure Status:x3/x103 TMO:x14" errors.
I see a comment earlier that suggests it is normal ( update #31 from Guilherme ).

I'm wondering if this is another event to add to our ignore list. In
addition to the comment from Guilherme I can see a very similar event
already in our ignore list due to feedback we received on SW315535 -
event in that case was "FLOGI failure Status:x3/x103 TMO:x4". I'm not
sure what the difference between TMO:x4 vs TMO:x14 is

Is it ok to add "FLOGI failure Status:x3/x103 TMO:x14" events to our
ignore list also or is more debug required ?

root@rcx2c360:/tmp# dmesg -T --level=alert,crit,err
[Wed Jun  1 13:26:03 2016] lpfc 0000:01:00.0: 0:1303 Link Up Event x1 received Data: x1 x0 x80 x0 x0 x0 0
[Wed Jun  1 13:26:03 2016] lpfc 0000:01:00.0: 0:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
[Wed Jun  1 13:26:03 2016] lpfc 0000:01:00.0: 0:(0):0100 FLOGI failure Status:x3/x103 TMO:x14
[Wed Jun  1 13:26:04 2016] lpfc 0000:01:00.1: 1:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
[Wed Jun  1 13:26:04 2016] lpfc 0000:01:00.1: 1:(0):0100 FLOGI failure Status:x3/x103 TMO:x14

===>> If required, access to system:
1) Telnet rchd08e0.rchland.ibm.com ( login with userid=dlth1025, password=tim2fish )
2) from #1 execute ssh root@rcx2c360 (password is PASSW0RD)

==== State: Verify by: byrneadw on 02 June 2016 17:11:41 ====

considering TMO:x4 and TMO:x14 are timeout values it suggests to me this
is the same error we hit before with SW315535. The root cause of
SW315535 was the mfg usage of wrap plugs on the Fibre ports for the
purpose of running HTX. It resulted in the FLOGI message because a port
cannot login to itself.

The TMO values must have changed with Ubuntu 16.04 or new drivers as you
mentioned above. This is the first system with a Bluefin running with
16.04 we've had. In the past all our systems with Bluefin were running
in Habanero boxes with Ubuntu 14.04.03

In SW315535 Dan Eisenhauer commented :
"That "error" message means the link came up, so I am conjecturing that there is a wrap   plug installed,  The FLOGI failed messages would be expected in that case since a port cannot login to itself.  So, all those messages are expected and indicate that a wrap plug is installed and the adapters are functioning.  Those can all be ignored."

I removed the wrap plugs on our Garrison system and was able to boot many times without hitting this error. I think that matches the results of SW315535.
I'll confirm with our HTX guy that we need to continue using these wraps for HTX. If we do need them I can ignore this message per Dan's analysis. I can change the current entry in our ignore list so that the regex doesn't include the TMO value as that might change and catch us out again.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1587316

Title:
  STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times
  during boot then disabled SRC BA188002:b0314a_1612.840

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Yakkety:
  Fix Released

Bug description:
  == Comment: #0 - Application Cdeadmin <cdeadmin@xxxxxxxxxx> -
  2016-03-21 15:55:09 ==

  
  == Comment: #1 - Application Cdeadmin <cdeadmin@xxxxxxxxxx> - 2016-03-21 15:55:11 ==
  ==== State: Open by: mlfield on 21 March 2016 14:45:01 ====

  ==========================Automatic entries==========================
  Contact: LittleField, Michael *CONTRACTOR*
  Backup: Thirukumaran V T (Thirukumaran@xxxxxxxxxx), Deepti Umarani (deeptiumarani@xxxxxxxxxx), Brian M. Carpenter(carp@xxxxxxxxxx)

  ===== sys_capture v5.24 === 2016-03-21_14-25-41 ===========

  |
  |    |
  |    System Hardware Information:
  |      NODE /Sys-0/Node-0, U78C7.001.1AQH383-P2
  |         FSP  /Sys-0/Node-0/FSP-0, FSP-2 DD 1.0, U78C7.001.1AQH383-P1-C5
  |            PSI  /Sys-0/Node-0/FSP-0/PSI-0
  |            PSI  /Sys-0/Node-0/FSP-0/PSI-1
  |         MEMBUF /Sys-0/Node-0/Membuf-12, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C11
  |         MEMBUF /Sys-0/Node-0/Membuf-13, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C10
  |         MEMBUF /Sys-0/Node-0/Membuf-14, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C12
  |         MEMBUF /Sys-0/Node-0/Membuf-15, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C13
  |         MEMBUF /Sys-0/Node-0/Membuf-20, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C23
  |         MEMBUF /Sys-0/Node-0/Membuf-21, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C22
  |         MEMBUF /Sys-0/Node-0/Membuf-22, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C24
  |         MEMBUF /Sys-0/Node-0/Membuf-23, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C25
  |         MEMBUF /Sys-0/Node-0/Membuf-28, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C19
  |         MEMBUF /Sys-0/Node-0/Membuf-29, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C18
  |         MEMBUF /Sys-0/Node-0/Membuf-30, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C20
  |         MEMBUF /Sys-0/Node-0/Membuf-31, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C21
  |         MEMBUF /Sys-0/Node-0/Membuf-36, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C31
  |         MEMBUF /Sys-0/Node-0/Membuf-37, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C30
  |         MEMBUF /Sys-0/Node-0/Membuf-38, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C32
  |         MEMBUF /Sys-0/Node-0/Membuf-39, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C33
  |         MEMBUF /Sys-0/Node-0/Membuf-4, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C15
  |         MEMBUF /Sys-0/Node-0/Membuf-44, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C27
  |         MEMBUF /Sys-0/Node-0/Membuf-45, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C26
  |         MEMBUF /Sys-0/Node-0/Membuf-46, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C28
  |         MEMBUF /Sys-0/Node-0/Membuf-47, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C29
  |         MEMBUF /Sys-0/Node-0/Membuf-5, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C14
  |         MEMBUF /Sys-0/Node-0/Membuf-52, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C39
  |         MEMBUF /Sys-0/Node-0/Membuf-53, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C38
  |         MEMBUF /Sys-0/Node-0/Membuf-54, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C40
  |         MEMBUF /Sys-0/Node-0/Membuf-55, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C41
  |         MEMBUF /Sys-0/Node-0/Membuf-6, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C16
  |         MEMBUF /Sys-0/Node-0/Membuf-60, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C35
  |         MEMBUF /Sys-0/Node-0/Membuf-61, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C34
  |         MEMBUF /Sys-0/Node-0/Membuf-62, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C36
  |         MEMBUF /Sys-0/Node-0/Membuf-63, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C37
  |         MEMBUF /Sys-0/Node-0/Membuf-7, CENTAUR EC 2.0, U78C7.001.1AQH383-P2-C17
  |         PROC /Sys-0/Node-0/Proc-0, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C2
  |            CORE /Sys-0/Node-0/Proc-0/EX-12/Core-0
  |            CORE /Sys-0/Node-0/Proc-0/EX-13/Core-0
  |            CORE /Sys-0/Node-0/Proc-0/EX-14/Core-0
  |            CORE /Sys-0/Node-0/Proc-0/EX-4/Core-0
  |            PCI  /Sys-0/Node-0/Proc-0/PCI-0
  |            PCI  /Sys-0/Node-0/Proc-0/PCI-1
  |            PCI  /Sys-0/Node-0/Proc-0/PCI-2
  |            PSI  /Sys-0/Node-0/Proc-0/PSI-0
  |         PROC /Sys-0/Node-0/Proc-1, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C2
  |            CORE /Sys-0/Node-0/Proc-1/EX-13/Core-0
  |            CORE /Sys-0/Node-0/Proc-1/EX-14/Core-0
  |            CORE /Sys-0/Node-0/Proc-1/EX-4/Core-0
  |            CORE /Sys-0/Node-0/Proc-1/EX-5/Core-0
  |            PCI  /Sys-0/Node-0/Proc-1/PCI-0
  |            PCI  /Sys-0/Node-0/Proc-1/PCI-1
  |            PCI  /Sys-0/Node-0/Proc-1/PCI-2
  |         PROC /Sys-0/Node-0/Proc-2, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C3
  |            CORE /Sys-0/Node-0/Proc-2/EX-13/Core-0
  |            CORE /Sys-0/Node-0/Proc-2/EX-14/Core-0
  |            CORE /Sys-0/Node-0/Proc-2/EX-4/Core-0
  |            CORE /Sys-0/Node-0/Proc-2/EX-5/Core-0
  |            PCI  /Sys-0/Node-0/Proc-2/PCI-0
  |            PCI  /Sys-0/Node-0/Proc-2/PCI-1
  |            PCI  /Sys-0/Node-0/Proc-2/PCI-2
  |            PSI  /Sys-0/Node-0/Proc-2/PSI-0
  |         PROC /Sys-0/Node-0/Proc-3, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C3
  |            CORE /Sys-0/Node-0/Proc-3/EX-12/Core-0
  |            CORE /Sys-0/Node-0/Proc-3/EX-13/Core-0
  |            CORE /Sys-0/Node-0/Proc-3/EX-4/Core-0
  |            CORE /Sys-0/Node-0/Proc-3/EX-6/Core-0
  |            PCI  /Sys-0/Node-0/Proc-3/PCI-0
  |            PCI  /Sys-0/Node-0/Proc-3/PCI-1
  |            PCI  /Sys-0/Node-0/Proc-3/PCI-2
  |         PROC /Sys-0/Node-0/Proc-4, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C6
  |            CORE /Sys-0/Node-0/Proc-4/EX-12/Core-0
  |            CORE /Sys-0/Node-0/Proc-4/EX-13/Core-0
  |            CORE /Sys-0/Node-0/Proc-4/EX-14/Core-0
  |            CORE /Sys-0/Node-0/Proc-4/EX-6/Core-0
  |            PCI  /Sys-0/Node-0/Proc-4/PCI-0
  |            PCI  /Sys-0/Node-0/Proc-4/PCI-1
  |            PCI  /Sys-0/Node-0/Proc-4/PCI-2
  |         PROC /Sys-0/Node-0/Proc-5, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C6
  |            CORE /Sys-0/Node-0/Proc-5/EX-12/Core-0
  |            CORE /Sys-0/Node-0/Proc-5/EX-13/Core-0
  |            CORE /Sys-0/Node-0/Proc-5/EX-14/Core-0
  |            CORE /Sys-0/Node-0/Proc-5/EX-4/Core-0
  |            PCI  /Sys-0/Node-0/Proc-5/PCI-0
  |            PCI  /Sys-0/Node-0/Proc-5/PCI-1
  |            PCI  /Sys-0/Node-0/Proc-5/PCI-2
  |         PROC /Sys-0/Node-0/Proc-6, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C7
  |            CORE /Sys-0/Node-0/Proc-6/EX-12/Core-0
  |            CORE /Sys-0/Node-0/Proc-6/EX-14/Core-0
  |            CORE /Sys-0/Node-0/Proc-6/EX-4/Core-0
  |            CORE /Sys-0/Node-0/Proc-6/EX-5/Core-0
  |            PCI  /Sys-0/Node-0/Proc-6/PCI-0
  |            PCI  /Sys-0/Node-0/Proc-6/PCI-1
  |            PCI  /Sys-0/Node-0/Proc-6/PCI-2
  |         PROC /Sys-0/Node-0/Proc-7, P8/Murano DD 2.1, U78C7.001.1AQH383-P2-C7
  |            CORE /Sys-0/Node-0/Proc-7/EX-12/Core-0
  |            CORE /Sys-0/Node-0/Proc-7/EX-13/Core-0
  |            CORE /Sys-0/Node-0/Proc-7/EX-14/Core-0
  |            CORE /Sys-0/Node-0/Proc-7/EX-6/Core-0
  |            PCI  /Sys-0/Node-0/Proc-7/PCI-0
  |            PCI  /Sys-0/Node-0/Proc-7/PCI-1
  |            PCI  /Sys-0/Node-0/Proc-7/PCI-2
  |
  |    System Hardware Summary:
  |      Configured Proc Cores: 32
  |      Configured IO UNITs:   24
  |      Configured PCIe PHB:   24
  |      Installed Nodes:       1
  |
  |    Hardware InitFile Information:
  |        No tool support for FIRENZE
  |
  |    Hardware (CINI) Frequency Information:
  |        No tool support for FIRENZE
  |
  |    VPD Information:
  |      Backplane VPD:
  |        None found or VPD info is not available.
  |      VPD LID Information:
  |        VPD LID File [/opt/extucode/80e00040.lid]:
  |          VPD Keyword: [LX], Data: [3100050100300040]
  |        VPD LID File [/opt/extucode/80e00041.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300041]
  |        VPD LID File [/opt/extucode/80e00042.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300042]
  |        VPD LID File [/opt/extucode/80e00043.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300043]
  |        VPD LID File [/opt/extucode/80e00044.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300044]
  |        VPD LID File [/opt/extucode/80e00047.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300047]
  |             Format:        0x31   (1)
  |             Enclosure ID:  0x0004 (P8 HV (Tuleta))
  |             Server Type:   0x01   (i/pSeries)
  |             FRU Type:      0x00   (Backplane)
  |             VPD Pass:      0x30   (0)
  |             LID Name:      0x0047 (P8 Alpine xS4U)
  |        VPD LID File [/opt/extucode/80e00050.lid]:
  |          VPD Keyword: [LX], Data: [3100060100300050]
  |        VPD LID File [/opt/extucode/80e00051.lid]:
  |          VPD Keyword: [LX], Data: [3100060100300051]
  |        VPD LID File [/opt/extucode/80e00942.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300942]
  |        VPD LID File [/opt/extucode/80e00944.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300944]
  |        VPD LID File [/opt/extucode/80e00947.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300947]
  |             Format:        0x31   (1)
  |             Enclosure ID:  0x0004 (P8 HV (Tuleta))
  |             Server Type:   0x01   (i/pSeries)
  |             FRU Type:      0x00   (Backplane)
  |             VPD Pass:      0x30   (0)
  |             LID Name:      0x0947 (P8 Alpine Storage/Shark)
  |        VPD LID File [/opt/extucode/80e00ff0.lid]:
  |          VPD Keyword: [LX], Data: [3100040100300FF0]
  |
  |    WARNINGS:
  |      * Informational: This machine has signed firmware (ship image)
  |
  |    ERRL: Attempting to dump error logs using errl...
  |      Dumping all error logs on FSP to file...
  |      ERRL: The FSP stopped responding... skipping
  |
  |    FFDC:
  |      FNM: Attempting connection for basic health check...
  |        TimeSincePhypStarted=82:13:57.539
  |        No failed tasks found.
  |
  |      FNM: Attempting connection for PHYP FFDC...
  |          FNM PHYP FFDC data stored in /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537.phyp
  |
  |      FipS MyFFDC: Was not attempted.  Reason:[Not requested]
  |
  |      Cronus: Data collection not attempted. (Unable to use Cronus via SSH Tunnel)
  |
  |----- File(s) Created During Capture ------
  |    SysCapture Primary LogFile: /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537
  |    FNM PHYP FFDC stored in:    /fspmount/alpine/alp7fp1/b0314a_1612.840/fsp/PHYP.FFDC.20160321142537.phyp
  |
  ============== end of capture ==============

  ============================Manual entries===========================
  Title: STC840.20:Alpine:alp7fp1:Ubuntu 16.04, BlueFin (SAN) EEH 6 times during boot then disabled SRC BA188002:b0314a_1612.840

  Problem Description :
  Booting Ubuntu 16.04 with Blufin (SAN) and several other adapters, Bluefin EEH 6 times and then disabled, SRC BA188002 reported. All other adapters did not have any issues.

  ===================================END===============================
  ==== State: Open by: mlfield on 21 March 2016 14:47:26 ====

  Attached Dmesg Log: dmesg1.txt

  mlfield (mlfield@xxxxxxxxxx) added native attachment
  /opt/IBM/WebSphere/AppServer/profiles/cqweb/temp/ausratsrv5Node01/server1/TeamEAR/cqweb.war/dmesg1.txt
  on 2016-03-21 14:47:26

  == Comment: #2 - Application Cdeadmin <cdeadmin@xxxxxxxxxx> -
  2016-03-21 15:55:16 ==

  
  == Comment: #12 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> - 2016-04-04 14:09:48 ==
  Info from Mike on ST.
  Assigned the adapter in the drawer to the LPAR, it hit the problem just like the adapter in the CEC.
  This points to a kernel/driver problem, since 14.04 didn't hit the problem.

  
  mlfield@xxxxxxxxxx - Michael Littlefield/Austin/Contr/IBM: just added both bluefins, its happen with both so MEX and CEC.
  # Slot                   Description                                               Device(s)
  U78C7.001.1AQH383-P1-C4  PCI-E capable, Rev 3, 16x lanes with 16x lanes connected  fibre-channel
                                                                                     fibre-channel
  U78C7.001.1AQH383-P1-C6  PCI-E capable, Rev 3, 8x lanes with 8x lanes connected    0000:60:00.1
                                                                                     0000:60:00.0
  U78CD.001.FZH0132-P1-C1  PCI-E capable, Rev 3, 16x lanes with 16x lanes connected  fibre-channel
                                                                                     fibre-channel
  U78CD.001.FZH0132-P2-C1  PCI-E capable, Rev 3, 16x lanes with 16x lanes connected  0002:50:00.0
  U78CD.001.FZH0132-P2-C3  PCI-E capable, Rev 3, 8x lanes with 8x lanes connected    0003:70:00.0
  U78CD.001.FZH0132-P2-C6  PCI-E capable, Rev 3, 8x lanes with 8x lanes connected    0004:a0:00.5
                                                                                     0004:a0:00.4
                                                                                     0004:a0:00.3
                                                                                     0004:a0:00.2
                                                                                     0004:a0:00.1
                                                                                     0004:a0:00.0

  == Comment: #16 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> - 2016-04-12 18:00:26 ==
  Mike provided the LPAR for debugging earlier today.

  Observations.
  1) The NUMA nodes configuration is weird -- likely an effect of DLPAR of Memory/CPU.
  - node 0: has CPUs but has no memory
  - node 1: has CPUs and memory
  - node 6:  has no CPUs but has memory

  (0) root @ alp7p04: /root
  # numactl -H
  available: 3 nodes (0,2,6)
  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
  node 0 size: 0 MB
  node 0 free: 0 MB
  node 2 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
  node 2 size: 34216 MB
  node 2 free: 33248 MB
  node 6 cpus:
  node 6 size: 6644 MB
  node 6 free: 6568 MB
  node distances:
  node   0   2   6 
    0:  10  40  40 
    2:  40  10  40 
    6:  40  40  10 

  
  2) The problem does not reproduce with 14.04 kernel (4.2 from wily).

  Comparing the dmesg logs up to the point of failure, there are differences in the NUMA setup code.
  2a) A small offset difference in the NUMA DATA starting address. For example:

  16.04: [    0.000000] numa:   NODE_DATA [mem 0x9ffe46100-0x9ffe4ffff]

  14.04: [    0.000000] numa:   NODE_DATA [mem 0x9ffe45000-0x9ffe4ffff]

  2b) A *totally* different end address in the "Initmem setup node 0"

  16:04: [    0.000000] Initmem setup node 0 [mem
  0x0000000000000000-0x0000000000000000]

  14.04: [    0.000000] Initmem setup node 0 [mem
  0x0000000000000000-0xffffffffffffffff]


  In progress.
  I'll go through the NUMA setup code.

  == Comment: #20 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> - 2016-04-12 18:18:52 ==
  Booting the 16.04 kernel with the numa=off boot option.
  The EEH errors still happen, but at a very later time (e.g., the 6th error/permanent failure happens only after the login prompt)

  == Comment: #22 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> - 2016-04-13 10:23:33 ==
  (In reply to comment #16)
  > 2b) A *totally* different end address in the "Initmem setup node 0" 
  > 
  > 16:04: [    0.000000] Initmem setup node 0 [mem
  > 0x0000000000000000-0x0000000000000000]
  > 
  > 14.04: [    0.000000] Initmem setup node 0 [mem
  > 0x0000000000000000-0xffffffffffffffff]

  And this is the value on the original/reported dmesg attachment (on
  different NUMA node configuration, before some memory and CPUs were
  moved from this LPAR to another one):

  [Mon Mar 21 09:07:45 2016] Initmem setup node 0 [mem
  0x0000000000000000-0x00000078cfffffff]

  Notice it's non-zero as well as 14.04.. so not sure the NUMA
  differences have something directly related to this bug.

  == Comment: #27 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> - 2016-05-18 19:47:05 ==
  Assigning this bug to Guilherme per EEH debugging experience and contacts.

  From what we've discussed, this problem doesn't seem to be specific to the lpfc device driver. 
  This same adapter/driver works fine on other systems (it has passed our FVT Regression testing w/out this problem).
  So, we suspect of some changes either in EEH / machine/platform-dependent code that is causing this, given that the 14.04 HWE kernel doesn't show this issue on this same LPAR.

  == Comment: #30 - Guilherme Guaglianoni Piccoli <gpiccoli@xxxxxxxxxx> - 2016-05-25 16:35:50 ==
  Quick update on this one: I'm investigating since Monday, and what I found is that in those cases of spontaneous EEH, the PCI BARs of the device are fulfilled with 0xFF, indicating some kind of corruption in adapter's memory.

  To dump the PCI BARs I firstly booted without EEH (by using eeh=off).
  The problem reproduces on kernel upstream v4.5, but not in v4.4 - so
  it seems a regression.

  I'm studying the commits between those revisions, making bisects,
  etc...so we can find which commits introduced this behavior.

  Thanks,

  
  Guilherme

  == Comment: #31 - Guilherme Guaglianoni Piccoli <gpiccoli@xxxxxxxxxx> - 2016-05-27 18:59:09 ==
  Offending commit was found after doing some bisect and analysis on upstream kernel:

  d6de08cc462 ("lpfc: Fix the FLOGI discovery logic to comply with T11
  standards")

  When this comment was reverted in kernel 4.6, the problem disappeared.
  I do see some FLOGI failure on dmesg, but I guess this is somewhat normal (reference: https://access.redhat.com/solutions/400483);

  
  Now, next step is to investigate what's going on with this commit; it should has been tested before it was merged, so this could be a non-expected corner case we're experiencing. I guess Maur?cio's opinion would be really useful here, since he has much expertise in Fiber Channel devices (he should be back on next week's beginning).

  
  One more thought: it's important to determine what is the real priority of this bug, meaning if this is a stop ship or the impact on some release would be critical, we could ask Canonical to revert it until a proper fix be implemented. Guess Brian, Mauricio and Breno's opinion on this are valuable.

  Thanks,

  
  Guilherme

  == Comment: #32 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> - 2016-05-30 10:13:57 ==
  Guilherme,

  Thank you very much for the precise handling on this one. Reassigning
  it back to myself.

  I wouldn't imagine this was a driver specific problem, but given your
  pointer to this commit, it's indeed something in that direction -- the
  dmesg log confirm there's some involvement of the FLOGI (fabric login)
  steps (related to the mentioned commit)

  The devices have 2 ports (eg, PCI functions 0 and 1). 
  - Function 0 is processed first -- probe finishes OK, and it starts FLOGI steps. 
  - Function 1 starts probe during Function 0's FLOGI steps -- and Function 1 probe fails on with the EEH.

  So, the change in the FLOGI logic seems to be quite involved in the
  problems sensed by the mailbox commands that result in the EEH.

  More on this later.

  [    1.215858] lpfc 0001:01:00.0: enabling device (0144 -> 0146)
  ...
  [    2.143487] lpfc 0001:01:00.1: enabling device (0144 -> 0146)
  ...
  [    2.636592] lpfc 0001:01:00.0: 0:1303 Link Up Event x1 received Data: x1 x0 x80 x0 x0 x0 0
  [    2.638459] lpfc 0001:01:00.0: 0:(0):2858 FLOGI failure Status:x3/x103 TMO:x14 Data x1800 x0
  [    2.638464] lpfc 0001:01:00.0: 0:(0):0100 FLOGI failure Status:x3/x103 TMO:x14
  [    2.639019] EEH: Frozen PHB#1-PE#10000 detected
  ...
  [    2.639049] [c00000084f612ee0] [c000000000037a84] eeh_check_failure+0x84/0xd0
  [    2.639061] [c00000084f612f20] [d000000008ed3cc4] lpfc_sli4_wait_bmbx_ready+0x114/0x150 [lpfc]
  ...
  [    2.639086] [c00000084f6131c0] [d000000008ee7780] lpfc_cq_create+0x210/0x370 [lpfc].
  ...
  [    2.639113] [c00000084f613550] [d000000008f23a28] lpfc_pci_probe_one+0x1248/0x13d0 [lpfc]
  [    2.639117] [c00000084f6135f0] [c0000000005daefc] local_pci_probe+0x6c/0x140
  ...
  [    2.639158] lpfc 0001:01:00.1: 1:(0):2544 Mailbox command x9b (x1/xc) cannot issue Data: x200 x1
  ...
  [    2.639166] lpfc 0001:01:00.1: 1:2501 CQ_CREATE mailbox failed with status x0 add_status x0, mbx status xff
  ...

  == Comment: #33 - Guilherme Guaglianoni Piccoli <gpiccoli@xxxxxxxxxx> - 2016-05-30 12:56:21 ==
  Thanks Maur?cio!

  I noticed compiling kernel both with the commit and without it (by
  reverting it), the following if is taken on lpfc_mbox_dev_check() :

  if (phba->link_state == LPFC_HBA_ERROR)

  So, in both cases the link_state is off but the commit perhaps introduced some order re-arrangement in the way it cannot handle anymore with this fail, maybe because of a race condition between threads.
  This conclusion came from the following snippet of commit message:

  "Required reworking the call sequence in the discovery threads."

  
  Thanks for taking from now.
  Cheers,

  
  Guilherme

  == Comment: #34 - Breno Henrique Leitao <brenohl@xxxxxxxxxx> - 2016-05-30 13:25:00 ==
  > we could ask Canonical to revert it until a proper fix be
  > implemented. Guess Brian, Mauricio and Breno's opinion on this are valuable.

  Well, it will not be simple to ask them to revert it. Although we
  requested the lpfc package upgrade [via bug #132388], there was
  another request to do so (LP: #1541592), so, I would suggest trying to
  propose a fix, other than asking to revert this commit.

  Does it make sense?

  == Comment: #35 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> - 2016-05-30 14:17:25 ==
  It seems this commit might fix the problem. I'm working on a  build  with it.

  ae09c765109293b600ba9169aa3d632e1ac1a843
  lpfc: Fix DMA faults observed upon plugging loopback connector

  Driver didn't program the REG_VFI mailbox correctly, giving the adapter
  bad addresses.

  == Comment: #36 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> - 2016-05-30 17:35:30 ==
  Hi Canonical,

  Can you please apply this fix for the lpfc driver?

  This upstream commit fixes the problem:

  	ae09c765109293b600ba9169aa3d632e1ac1a843
  	lpfc: Fix DMA faults observed upon plugging loopback connector

  Original kernel (4.4.0-22.40)

  	root@alp7p04:~# uname -a
  	Linux alp7p04 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:35 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

  	root@alp7p04:~# dmesg | grep -i eeh
  	[    0.051252] EEH: pSeries platform initialized
  	[    0.137050] EEH: devices created
  	[    0.167121] EEH: PCI Enhanced I/O Error Handling Enabled
  	[    3.039195] EEH: Frozen PHB#3-PE#10000 detected
  	[    3.039211] EEH: PE location: N/A, PHB location: N/A
  	[    3.039234] [c00000062fa16e40] [c0000000000379b4] eeh_dev_check_failure+0x534/0x580
  	[    3.039237] [c00000062fa16ee0] [c000000000037a84] eeh_check_failure+0x84/0xd0
  	[    3.039398] EEH: Detected PCI bus error on PHB#3-PE#10000
  	<...>

  Patched kernel (4.4.0-22.40 + patch)

  	root@alp7p04:~# uname -a
  	Linux alp7p04 4.4.0-22-generic #40+bz139414c35 SMP Mon May 30 10:54:04 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux

  	root@alp7p04:~# dmesg | grep -i eeh
  	[    0.051222] EEH: pSeries platform initialized
  	[    0.137348] EEH: devices created
  	[    0.167359] EEH: PCI Enhanced I/O Error Handling Enabled
  	root@alp7p04:~#

  == Comment: #38 - Mauricio Faria De Oliveira <mauricfo@xxxxxxxxxx> -
  2016-05-30 17:42:13 ==

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1587316/+subscriptions