← Back to team overview

kernel-packages team mailing list archive

[Bug 1491494] Re: Ubuntu 14.04.03 LPAR hits kernel oops after serial adapter is removed from profile

 

------- Comment From cdeadmin@xxxxxxxxxx 2015-10-26 18:25 EDT-------
==== State: Assigned by: nguyenp on 26 October 2015 13:14:50 ====

Per sametime with Gabriel this morning,he's working and building a
workaround for to the problem.

I'm lowering the severity of the defect since it's not a blocker.

** Tags removed: severity-critical
** Tags added: severity-high

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1491494

Title:
  Ubuntu 14.04.03 LPAR hits kernel oops after serial adapter is removed
  from profile

Status in linux package in Ubuntu:
  Triaged

Bug description:
  -- Problem Description --

  The failure related to the BELL-3 (2 port-Async EIA-232 adapter).
  Ubuntu always hit exception when the adapter is not present. See my
  test scenarios below.

  Test #1: Boot Ubuntu with BELL-3 adapter 
  =======

  - The Ubuntu LPAR was running with the BELL-3 (2 port-Async EIA-232 adapter) before. So I assigned the BELL-3 adapter to Ubuntu LPAR profile and powered on the LPAR. 
  => Ubuntu boot fine this time.

   
  Test #2: Boot Ubuntu with BELL-3 adapter removed from LPAR profile
  =======

  - I powered down the Ubuntu partition and removed the BELL-3 adapter from LPAR profile then powered on the LPAR.
  => Ubuntu hit the exception.

  Elapsed time since release of system processors: 0 mins 9 secs
  error: no suitable video mode found.
  OF stdout device is: /vdevice/vty@30000000
  Preparing to boot Linux version 3.19.0-23-generic (buildd@denneed03) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #24~14.04.1-Ubuntu SMP Wed Jul 8 11:17:19 UTC 2015 (Ubuntu 3.19.0-23.24~14.04.1-generic 3.19.8-ckt2)
  Detected machine type: 0000000000000101
  Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
  Calling ibm,client-architecture-support... done
  command line: BOOT_IMAGE=/boot/vmlinux-3.19.0-23-generic root=UUID=768190e7-f633-4c63-a1e3-588d12dea265 ro quiet splash vt.handoff=7
  memory layout at init:
    memory_limit : 0000000000000000 (16 MB aligned)
    alloc_bottom : 000000000b420000
    alloc_top    : 0000000010000000
    alloc_top_hi : 0000000010000000
    rmo_top      : 0000000010000000
    ram_top      : 0000000010000000
  instantiating rtas at 0x000000000ecb0000... done
  prom_hold_cpus: skipped
  copying OF device tree...
  Building dt strings...
  Building dt structure...
  Device tree strings 0x000000000b430000 -> 0x000000000b4316b1
  Device tree struct  0x000000000b440000 -> 0x000000000b470000
  Calling quiesce...
  returning from prom_init
   -> smp_release_cpus()
  spinning_secondaries = 15
   <- smp_release_cpus()
   <- setup_system()
  [    0.661510] /build/linux-lts-vivid-uV14Ja/linux-lts-vivid-3.19.0/drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
  [    0.672826] sd 0:0:1:0: [sda] Assuming drive cache: write through
  [    4.658302] device-mapper: table: 252:0: multipath: error getting device
  [    4.691990] device-mapper: table: 252:0: multipath: error getting device
  [    4.934034] device-mapper: table: 252:0: multipath: error getting device
  [    4.951977] device-mapper: table: 252:0: multipath: error getting device
   * Discovering and coalescing multipaths...                              [ OK ] 
  Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
   * Starting AppArmor profiles                                            [ OK ] 
  Loading the saved-state of the serial devices... 
  [    5.109665] Unable to handle kernel paging request for data at address 0xd000080000000003
  [    5.109677] Faulting instruction address: 0xc00000000060fec4
  [    5.109685] Oops: Kernel access of bad area, sig: 11 [#1]
  [    5.109691] SMP NR_CPUS=2048 NUMA pSeries
  [    5.109699] Modules linked in: dm_round_robin dm_multipath scsi_dh pseries_rng rtc_generic knem(OE) nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) mlx4_en(OE) mlx4_core(OE) mlx_compat(OE)
  [    5.109759] CPU: 1 PID: 1816 Comm: setserial Tainted: G           OE  3.19.0-23-generic #24~14.04.1-Ubuntu
  [    5.109769] task: c0000000f389c880 ti: c0000000f0528000 task.ti: c0000000f0528000
  [    5.109777] NIP: c00000000060fec4 LR: c000000000617498 CTR: c00000000060fe20
  [    5.109785] REGS: c0000000f052b6b0 TRAP: 0300   Tainted: G           OE   (3.19.0-23-generic)
  [    5.109793] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 84002022  XER: 00000000
  [    5.109814] CFAR: c000000000008468 DAR: d000080000000003 DSISR: 42000000 SOFTE: 1 
  GPR00: c000000000617498 c0000000f052b930 c00000000144c700 00000000000000bf 
  GPR04: d000080000000003 00000000000000bf c0000000f3990000 0000000000000141 
  GPR08: c000000000611d20 c0000000013539e0 d000080000000000 c000000001351ba8 
  GPR12: c00000000060fe20 c00000000e830900 0000000000000000 0000000000000000 
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
  GPR20: 000000000000007d 0000000000000040 0000000000000000 0000000000000000 
  GPR24: 0000000000000000 c0000000f53cbc00 0000000000000001 0000000000000000 
  GPR28: c0000000f53cbde0 00000000000000bf 0000000000000003 c000000001754970 
  [    5.109916] NIP [c00000000060fec4] io_serial_out+0xa4/0xd0
  [    5.109924] LR [c000000000617498] serial8250_do_startup+0x978/0xe50
  [    5.109931] Call Trace:
  [    5.109936] [c0000000f052b930] [c0000000f052b970] 0xc0000000f052b970 (unreliable)
  [    5.109948] [c0000000f052b970] [c000000000617498] serial8250_do_startup+0x978/0xe50
  [    5.109958] [c0000000f052ba10] [c00000000060eb00] uart_startup.part.7+0xd0/0x310
  [    5.109967] [c0000000f052ba60] [c00000000060f1ac] uart_set_info+0x46c/0x580
  [    5.109976] [c0000000f052bb90] [c00000000060f378] uart_ioctl+0xb8/0x590
  [    5.109986] [c0000000f052bc40] [c0000000005dd89c] tty_ioctl+0x21c/0xf60
  [    5.109995] [c0000000f052bd40] [c0000000002ce680] do_vfs_ioctl+0x4f0/0x7c0
  [    5.110004] [c0000000f052bde0] [c0000000002cea24] SyS_ioctl+0xd4/0xf0
  [    5.110014] [c0000000f052be30] [c000000000009258] system_call+0x38/0xd0
  [    5.110021] Instruction dump:
  [    5.110026] 38210040 e8010010 eba1ffe8 ebc1fff0 ebe1fff8 7c0803a6 4e800020 3d42fff0 
  [    5.110040] 392a72e0 e9490000 7c845214 7c0004ac <98640000> 39200001 992d02bc 38210040 
  [    5.110057] ---[ end trace 7c597ccc52ffb926 ]---
  [    5.114039] 

  3) Test 3: DLPAR removed the adapter first then reboot the LPAR  
     ======
  - I powered down the Ubuntu LPAR.
  - I then assigned the BELL-3 adapter back in Ubuntu LPAR profile. Then powered the partition.
  - It boot fine with no problem.

  root@tul7p07:~# lspci
  60:00.0 Serial controller: Digi International Device 00f6

    0000:60:00.0 ttyS0 ttyS1 serial U78CB.001.WZS02NH-P1-C12-T1
                                           serial (1410f600)
          Manufacturer Name.........IBM
          Machine Type-Model........Unknown
          Device Specific.(YC)......0
          Location Code.(YL)........U78CB.001.WZS02NH-P1-C12-T1

    ttyS0            U78CB.001.WZS02NH-P1-C12-T1
                                           Serial Device
          Location Code.(YL)........U78CB.001.WZS02NH-P1-C12-T1

    ttyS1            U78CB.001.WZS02NH-P1-C12-T1
                                           Serial Device 
          Location Code.(YL)........U78CB.001.WZS02NH-P1-C12-T1

  - I then went into HMC and performed the DLPAR remove adater this
  time. The operation completed successfully.

  - I then powered down and check LPAR profile (No more BELL-3 adapter
  assigned).

  - I then powered up the Ubuntu LPAR again. Still hit exception in this
  case.

  So Ubuntu always hit exception when the adapter is not present.

  
  The system does show a config file originally created on Jul 30. The /etc/init.d/setserial is the startup service that attempts to configure the serial devices either using /etc/serial.conf (there isn't one) or /var/lib/setserial/autoserial.conf which does exist.

  root@tul7p07:/etc/init.d# ls -l /var/lib/setserial/autoserial.conf
  -rw-r--r-- 1 root root 518 Jul 30 00:27 /var/lib/setserial/autoserial.conf
  root@tul7p07:/etc/init.d# ls /etc/serial.conf
  ls: cannot access /etc/serial.conf: No such file or directory
  root@tul7p07:/etc/init.d# cat /var/lib/setserial/autoserial.conf
  ###PORT STATE GENERATED USING AUTOSAVE-ONCE###
  ###AUTOSAVE-ONCE###
  ###AUTOSAVE-ONCE###
  ###AUTOSAVE###
  #
  # If you want to configure this file by hand, use 
  # dpkg-reconfigure setserial
  # and change the configuration mode of the file to MANUAL. If you do not do this, this file may be overwritten automatically the next time you upgrade the
  # package.
  #
  /dev/ttyS0 uart 16950/954 port 0x0000 irq 0 baud_base 4000000 spd_normal skip_test
  /dev/ttyS1 uart 16950/954 port 0x0000 irq 0 baud_base 4000000 spd_normal skip_test

  I am thinking that if you rename or mv the
  /var/lib/setserial/autoserial.conf so it doesn't find it (or disable
  the setserial service might work, too) it may just come up without the
  adapter.

  So, next step is to rename or move that conf file, shutdown the
  partition, remove the digi adapter from the profile and see what
  happens when we come back up. If it comes back up the question will
  be, what should the OS if it has autosaved configuration info on the
  ports and then the adapter is removed? Should the system ensure those
  devices are still present before attempting to tell the kernel to
  configure; should the kernel have more sanity checks?

  Thanks to Luciano C. for pointed out the issues. I ran tests and
  confirmed that what he pointed out is correct.

  So now we need to address these questions from his previous comment:

  - what should the OS if it has autosaved configuration info on the ports and then the adapter is removed? 
  - Should the system ensure those devices are still present before attempting to tell the kernel to configure; 
  - should the kernel have more sanity checks?

  
  Here are the tests I ran:
  ========================

  1) First, I booted Ubuntu with serial adapter.

  root@tul7p07:~# lspci
  60:00.0 Serial controller: Digi International Device 00f6
  root@tul7p07:~# 

  2) Then I moved /var/lib/setserial/autoserial.conf to a different
  name. (Per Luciano C. instruction).

  root@tul7p07:~# mv /var/lib/setserial/autoserial.conf /var/lib/setserial/autoserial.conf.org
  root@tul7p07:~# ls -l /var/lib/setserial/autoserial.conf*
  -rw-r--r-- 1 root root 305 Jul 30 00:27 /var/lib/setserial/autoserial.conf.old
  -rw-r--r-- 1 root root 518 Jul 30 00:27 /var/lib/setserial/autoserial.conf.org

  
  3) I then Shutdowned the Ubuntu partition and removed serial adapter from partition's profile.  
  Then I boot it up again. The system came up to the login prompt.

  =====
  Ubuntu 14.04.3 LTS tul7p07.aus.stglabs.ibm.com hvc0

  tul7p07 login: root
  Password: 

  =================

  
  4) I then added the serial adapter back in Ubuntu partition's profile and booted the partition up again.

  ====
  Ubuntu 14.04.3 LTS tul7p07.aus.stglabs.ibm.com hvc0

  tul7p07 login: root
  Password: 
  Last login: Wed Sep  2 09:49:21 CDT 2015 on hvc0
  Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.19.0-27-generic ppc64le)

   * Documentation:  https://help.ubuntu.com/
  root@tul7p07:~# lspci
  60:00.0 Serial controller: Digi International Device 00f6
  root@tul7p07:~# 
  ====

  5) I checked /var/lib/setserial/. It created a new autoserial.conf
  like expected.

  root@tul7p07:~# ls -l /var/lib/setserial/
  total 12
  -rw-r--r-- 1 root root  15 Sep  2 09:52 autoserial.conf
  -rw-r--r-- 1 root root  15 Sep  2 09:47 autoserial.conf.old
  -rw-r--r-- 1 root root 518 Jul 30 00:27 autoserial.conf.org
  -rw-r--r-- 1 root root   0 Jul 30 00:27 etc.serial.conf.bkp
  root@tul7p07:~# 

  6) I then shutdown Ubuntu partition without removed or renamed the
  autoserial.conf file.

  7) I removed the serial adapter from Ubuntu partition's profile and
  booted the partion again. The kernel again tried to configured the
  serial port memory address which is now a bogu address so it hit the
  problem again.

  ==========
  Preparing to boot Linux version 3.19.0-27-generic (buildd@fisher04) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #29~14.04.1-Ubuntu SMP Sun Aug 16 01:51:48 UTC 2015 (Ubuntu 3.19.0-27.29~14.04.1-generic 3.19.8-ckt5)
  Detected machine type: 0000000000000101
  Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
  Calling ibm,client-architecture-support... done
  command line: BOOT_IMAGE=/boot/vmlinux-3.19.0-27-generic root=UUID=768190e7-f633-4c63-a1e3-588d12dea265 ro quiet splash vt.handoff=7
  memory layout at init:
    memory_limit : 0000000000000000 (16 MB aligned)
    alloc_bottom : 000000000b400000
    alloc_top    : 0000000010000000
    alloc_top_hi : 0000000010000000
    rmo_top      : 0000000010000000
    ram_top      : 0000000010000000
  instantiating rtas at 0x000000000ecb0000... done
  prom_hold_cpus: skipped
  copying OF device tree...
  Building dt strings...
  Building dt structure...
  Device tree strings 0x000000000b410000 -> 0x000000000b4116b1
  Device tree struct  0x000000000b420000 -> 0x000000000b450000
  Calling quiesce...
  returning from prom_init
   -> smp_release_cpus()
  spinning_secondaries = 15
   <- smp_release_cpus()
   <- setup_system()
  [    0.643938] /build/linux-lts-vivid-4KQgBt/linux-lts-vivid-3.19.0/drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
  [    0.656156] sd 0:0:1:0: [sda] Assuming drive cache: write through
   * Discovering and coalescing multipaths...                              [ OK ] 
  Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
   * Starting AppArmor profiles                                            [ OK ] 
  Loading the saved-state of the serial devices... 
  [    5.276868] Unable to handle kernel paging request for data at address 0xd000080000000003
  [    5.276880] Faulting instruction address: 0xc00000000060f684
  [    5.276888] Oops: Kernel access of bad area, sig: 11 [#1]
  [    5.276894] SMP NR_CPUS=2048 NUMA pSeries
  [    5.276902] Modules linked in: dm_multipath scsi_dh pseries_rng ib_ipoib rdma_ucm rtc_generic rdma_cm iw_cm ib_ucm ib_uverbs ib_cm ib_umad mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_core nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache [last unloaded: mlx5_core]
  [    5.276960] CPU: 8 PID: 1466 Comm: setserial Not tainted 3.19.0-27-generic #29~14.04.1-Ubuntu
  [    5.276969] task: c0000000f2065300 ti: c0000000f21c4000 task.ti: c0000000f21c4000
  [    5.276977] NIP: c00000000060f684 LR: c000000000616c58 CTR: c00000000060f5e0
  [    5.276985] REGS: c0000000f21c76b0 TRAP: 0300   Not tainted  (3.19.0-27-generic)
  [    5.276992] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 84002022  XER: 00000000
  [    5.277012] CFAR: c000000000008468 DAR: d000080000000003 DSISR: 42000000 SOFTE: 1 
  GPR00: c000000000616c58 c0000000f21c7930 c00000000144cc00 00000000000000bf 
  GPR04: d000080000000003 00000000000000bf c0000000f54b0000 0000000000000141 
  GPR08: c0000000006114e0 c0000000013539e0 d000080000000000 c000000001351ba8 
  GPR12: c00000000060f5e0 c00000000e834800 0000000000000000 0000000000000000 
  GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
  GPR20: 000000000000007d 0000000000000040 0000000000000000 0000000000000000 
  GPR24: 0000000000000000 c0000000f8092000 0000000000000001 0000000000000000 
  GPR28: c0000000f80921e0 00000000000000bf 0000000000000003 c0000000017549f0 
  [    5.277115] NIP [c00000000060f684] io_serial_out+0xa4/0xd0
  [    5.277122] LR [c000000000616c58] serial8250_do_startup+0x978/0xe50
  [    5.277129] Call Trace:
  [    5.277134] [c0000000f21c7930] [c0000000f21c7970] 0xc0000000f21c7970 (unreliable)
  [    5.277145] [c0000000f21c7970] [c000000000616c58] serial8250_do_startup+0x978/0xe50
  [    5.277155] [c0000000f21c7a10] [c00000000060e2c0] uart_startup.part.7+0xd0/0x310
  [    5.277164] [c0000000f21c7a60] [c00000000060e96c] uart_set_info+0x46c/0x580
  [    5.277173] [c0000000f21c7b90] [c00000000060eb38] uart_ioctl+0xb8/0x590
  [    5.277183] [c0000000f21c7c40] [c0000000005dd01c] tty_ioctl+0x21c/0xf60
  [    5.277192] [c0000000f21c7d40] [c0000000002ce7a0] do_vfs_ioctl+0x4f0/0x7c0
  [    5.277201] [c0000000f21c7de0] [c0000000002ceb44] SyS_ioctl+0xd4/0xf0
  [    5.277210] [c0000000f21c7e30] [c000000000009258] system_call+0x38/0xd0
  [    5.277217] Instruction dump:
  [    5.277222] 38210040 e8010010 eba1ffe8 ebc1fff0 ebe1fff8 7c0803a6 4e800020 3d42fff0 
  [    5.277236] 392a6de0 e9490000 7c845214 7c0004ac <98640000> 39200001 992d02bc 38210040 
  [    5.277252] ---[ end trace d5657031818c6b89 ]---
  [    5.280950] 
  [   11.843975] init: openibd pre-start process (1614) terminated with status 3

  =======================

  Mirroring to Launchpad for Canonical folks to take a look...

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1491494/+subscriptions