← Back to team overview

kernel-packages team mailing list archive

[Bug 1276705] Re: Kernel 3.13 fail to boot with LSI SAS1068E (Dell SAS 6/iR)

 

That return statement is called only when wait_for_completion_killable()
returned an error. That is, the caller received SIGKILL while waiting for
kthreadd to create a kernel thread.

That matches your bisection result because commit 786235ee changed to return to
the caller when the caller received SIGKILL in order to allow the OOM killer to
kill the process waiting for kthreadd to create a kernel thread.
The changelog which I expected for that commit is shown below.

----------
[PATCH] kthread: Make kthread_create() killable.

Any user process callers of wait_for_completion() except global init process
might be chosen by the OOM killer while waiting for completion() call by some
other process which does memory allocation.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
due to memory starvation because the OOM killer cannot kill such users.

kthread_create() is one of such users and this patch fixes the problem for
kthreadd by making kthread_create() killable.

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Acked-by: David Rientjes <rientjes@xxxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
----------

I think there are two problems listed below.

  (a) Somebody is sending SIGKILL to the caller of kthread_create().

        Somebody is "systemd" waited for timeout?
        The caller is "PID: 9847 Comm: systemd-udevd" ?

  (b) Error handling of the caller of kthread_create() is wrong.

        mptsas_probe() calls mptsas_remove() when
        scsi_host_alloc() returned NULL due to receiving SIGKILL.

        But mptsas_remove() assumes that "ioc->sh = sh;" was already called
        with sh != NULL which means scsi_host_alloc() did not return NULL.

        scsi_host_alloc() can return NULL when kzalloc() returned NULL.
        In other words, the caller of scsi_host_alloc() must be prepared for
        scsi_host_alloc() returning NULL even if the caller did not receive
        SIGKILL while waiting for kthreadd to create a kernel thread.

Therefore, I don't think reverting commit 786235ee is appropriate because
the problem will again happen when kzalloc() in scsi_host_alloc() fails.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1276705

Title:
  Kernel 3.13 fail to boot with LSI SAS1068E (Dell SAS 6/iR)

Status in “linux” package in Ubuntu:
  Confirmed
Status in “linux” source package in Trusty:
  Confirmed

Bug description:
  We have recently upgraded an Dell R300 server to Trusty (was running
  fine in precise), and after upgrade it fail to boot.

  It is an issue with the SAS controller during the initilisation. It
  fail to detect the disk, we have the following error in console log:

  [   36.539955] scsi4: error handler thread failed to spawn, error = -12
  [   36.552694] mptsas: ioc0: WARNING - Unable to register controller with SCSI subsystem

  After this error, initramfs drop to a shell complaining that rootfs is
  not found. No disk is seen at all (cat /proc/partition only show sr0 -
  cdrom drive).

  We have this issue with two different server (both R300, both Dell SAS
  6/iR controller and same hardware).

  We don't have this issue with another Dell server (R310, Dell PERC
  H200).

  We also tester with old kernel (generic, 3.2.0-58.88), it is working.

  Those server need a greater rootdelay (probably #579572), so we have
  rootdelay=45. If we remove rootdelay=45, then disk are correctly
  recognized ! (but few second too late, initramfs dropped to a shell.
  Pressing control-D resume normal boot)

  So the issue is that with the (mandatory) rootdelay greater that 30
  (default value I think), the disk are not detected due to the error
  shown above. This is a regression since those server worked in precise
  (and work with precise old kernel).

  
  System information

  * Dell R300 with Dell SAS 6/iR controller
  * Ubuntu Trusty Tahr (14.04)
  * Running arch: x86_64
  * Kernel version: 3.13.0-7-generic  (dpkg version : 3.13.0-7.25)
  * Kernel command line: BOOT_IMAGE=/vmlinuz-3.13.0-7-generic root=UUID=174e14b5-46fc-479b-9f94-05cb33c75ac9 ro rootdelay=45 console=tty0 console=ttyS1,57600 quiet
  * uname -a: Linux frtls-perf01 3.13.0-7-generic #25-Ubuntu SMP Tue Feb 4 10:19:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

  
  Attached files:

  * console output when error occure.
  * dmesg when system boot (no rootdelay, need to press control-d during initramfs boot)
  * lspci -vnn

  
  Tell me if you need more informations.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1276705/+subscriptions


References