← Back to team overview

kernel-packages team mailing list archive

[Bug 1508609] Re: [Hyper-V] Race condition in SMP bootup

 

** Changed in: linux-lts-utopic (Ubuntu)
       Status: In Progress => Fix Released

** Changed in: linux-lts-utopic (Ubuntu Trusty)
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1508609

Title:
  [Hyper-V] Race condition in SMP bootup

Status in linux package in Ubuntu:
  Fix Released
Status in linux-lts-utopic package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Invalid
Status in linux-lts-utopic source package in Trusty:
  Fix Released
Status in linux source package in Vivid:
  Fix Released
Status in linux source package in Wily:
  Fix Released

Bug description:
  Please integrate the following upstream commit.

  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dd9d3843755da95f63dd3a376f62b3e45c011210

  sched: Fix cpu_active_mask/cpu_online_mask race
  There is a race condition in SMP bootup code, which may result
  in

      WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
      workqueue_cpu_up_callback()
  or
      kernel BUG at kernel/smpboot.c:135!

  It can be triggered with a bit of luck in Linux guests running
  on busy hosts.

  	CPU0                        CPUn
  	====                        ====

  	_cpu_up()
  	  __cpu_up()
  				    start_secondary()
  				      set_cpu_online()
  					cpumask_set_cpu(cpu,
  						   to_cpumask(cpu_online_bits));
  	  cpu_notify(CPU_ONLINE)
  	    <do stuff, see below>
  					cpumask_set_cpu(cpu,
  						   to_cpumask(cpu_active_bits));

  During the various CPU_ONLINE callbacks CPUn is online but not
  active. Several things can go wrong at that point, depending on
  the scheduling of tasks on CPU0.

  Variant 1:

    cpu_notify(CPU_ONLINE)
      workqueue_cpu_up_callback()
        rebind_workers()
          set_cpus_allowed_ptr()

    This call fails because it requires an active CPU; rebind_workers()
    ends with a warning:

      WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
      workqueue_cpu_up_callback()

  Variant 2:

    cpu_notify(CPU_ONLINE)
      smpboot_thread_call()
        smpboot_unpark_threads()
         ..
          __kthread_unpark()
            __kthread_bind()
            wake_up_state()
             ..
              select_task_rq()
                select_fallback_rq()

    The ->wake_cpu of the unparked thread is not allowed, making a call
    to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
    find an allowed, active CPU and promptly resets the allowed CPUs, so
    that the task in question ends up on CPU0.

    When those unparked tasks are eventually executed, they run
    immediately into a BUG:

      kernel BUG at kernel/smpboot.c:135!

  Just changing the order in which the online/active bits are set
  (and adding some memory barriers), would solve the two issues
  above. However, it would change the order of operations back to
  the one before commit 6acbfb96976f ("sched: Fix hotplug vs.
  set_cpus_allowed_ptr()"), thus, reintroducing that particular
  problem.

  Going further back into history, we have at least the following
  commits touching this topic:
  - commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
  - commit 5fbd036b552f ("sched: Cleanup cpu_active madness")

  Together, these give us the following non-working solutions:

    - secondary CPU sets active before online, because active is assumed to
      be a subset of online;

    - secondary CPU sets online before active, because the primary CPU
      assumes that an online CPU is also active;

    - secondary CPU sets online and waits for primary CPU to set active,
      because it might deadlock.

  Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are
  active & online") introduces an arch-specific solution to this
  arch-independent problem.

  Now, go for a more general solution without explicit waiting and
  simply set active twice: once on the secondary CPU after online
  was set and once on the primary CPU after online was seen.

  set_cpus_allowed_ptr()")

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1508609/+subscriptions


References