group.of.nepali.translators team mailing list archive

Thread
Date
[Bug 1765241] Re: virtio_scsi race can corrupt memory, panic kernel

To: group.of.nepali.translators@xxxxxxxxxxxxxxxxxxx
From: Stefan Bader <stefan.bader@xxxxxxxxxxxxx>
Date: Fri, 20 Apr 2018 08:33:16 -0000
Reply-to: Bug 1765241 <1765241@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Xenial)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Xenial)
       Status: New => In Progress

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1765241

Title:
  virtio_scsi race can corrupt memory, panic kernel

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  In Progress

Bug description:
          There's a race in the virtio_scsi driver (for all kernels)

          That race is inadvertently avoided on most kernels due to a
  synchronize_rcu call coincidentally placed in one of the racing code paths

          By happenstance, the set of patches backported to the Ubuntu
  4.4 kernel ended up without a synchronize_rcu in the relevant place. The
  issue first manifests with 

  
  commit be2a20802abbde04ae09846406d7b670731d97d2
  Author: Jan Kara <jack@xxxxxxx>
  Date:   Wed Feb 8 08:05:56 2017 +0100

      block: Move bdi_unregister() to del_gendisk()
      
      BugLink: http://bugs.launchpad.net/bugs/1659111

  The race can cause a kernel panic due to corruption of a freelist
  pointer in a slab cache.  The panics occur in arbitrary places as
  the failure occurs at an allocation after the corruption of the
  pointer.  However, the most common failure observed has been within
  virtio_scsi itself during probe processing, e.g.:

  [    3.111628]  [<ffffffff811b0f32>] kfree_const+0x22/0x30
  [    3.112340]  [<ffffffff813db534>] kobject_release+0x94/0x190
  [    3.113126]  [<ffffffff813db3c7>] kobject_put+0x27/0x50
  [    3.113838]  [<ffffffff8153dee7>] put_device+0x17/0x20
  [    3.114568]  [<ffffffff815ac6b2>] __scsi_remove_device+0x92/0xe0
  [    3.115401]  [<ffffffff815a928b>] scsi_probe_and_add_lun+0x95b/0xe80
  [    3.116287]  [<ffffffff811f1083>] ? kmem_cache_alloc_trace+0x183/0x1f0
  [    3.117227]  [<ffffffff8154eb0b>] ? __pm_runtime_resume+0x5b/0x80
  [    3.118048]  [<ffffffff815a9eaa>] __scsi_scan_target+0x10a/0x690
  [    3.118879]  [<ffffffff815aa59e>] scsi_scan_channel+0x7e/0xa0
  [    3.119653]  [<ffffffff815aa743>] scsi_scan_host_selected+0xf3/0x160
  [    3.120506]  [<ffffffff815aa83d>] do_scsi_scan_host+0x8d/0x90
  [    3.121295]  [<ffffffff815aaa0c>] do_scan_async+0x1c/0x190
  [    3.122048]  [<ffffffff810a5748>] async_run_entry_fn+0x48/0x150
  [    3.122846]  [<ffffffff8109c6b5>] process_one_work+0x165/0x480
  [    3.123732]  [<ffffffff8109ca1b>] worker_thread+0x4b/0x4d0
  [    3.124508]  [<ffffffff8109c9d0>] ? process_one_work+0x480/0x480

  
  Details on the race:

  CPU A:

  virtscsi_probe
  [...]
  __scsi_scan_target
  scsi_probe_and_add_lun  [on return calls  __scsi_remove_device, below]
  scsi_probe_lun  
  [...]
  blk_execute_rq

          blk_execute_rq waits for the completion event, and then on wakeup
  returns up to scsi_probe_and_all_lun, which calls __scsi_remove_device.
  In order for the race to occur, the wakeup must occur on a CPU other than
  CPU B.

          After being woken up by the completion of the request, the call
  returns up the stack to scsi_probe_and_add_lun, which calls
  __scsi_remove_device:

  __scsi_remove_device
  blk_cleanup_queue
  [ no longer calls bdi_unregister ]
  scsi_target_reap(scsi_target(sdev))
  scsi_target_reap_ref_put
  kref_put
  kref_sub
  scsi_target_reap_ref_release
  scsi_target_destroy
  ->target_destroy() = virtscsi_target_destroy
          kfree(tgt)                                      <=== FREE TGT

          Note that the removal of the call to bdi_unregister in commit

    xenial be2a20802abbde block: Move bdi_unregister() to del_gendisk()

          and annotated above is the change that gates whether the issue
  manifests or not.  The other code change from be2a20802abbde has no effect
  on the race.

  CPU B:

  vring_interrupt
  virtscsi_complete_cmd
  scsi_mq_done (via ->scsi_done())
  scsi_mq_done
  blk_mq_complete_request
  __blk_mq_complete_request
  [...]
  blk_end_sync_rq
  complete
  [ wake up the task from CPU A ]

          After waking the CPU A task, execution returns up the stack, and
  then calls atomic_dec(&tgt->reqs) in virtscsi_complete_cmd immediately
  after returning from the call to ->scsi_done.

          If the activity on CPU A after it is woken up (starting at
  __scsi_remove_device) finishes before CPU B can call atomic_dec() in
  virtscsi_complete_cmd, then the atomic_dec() will modify a free list
  pointer in the freed slab object that contained tgt.  This causes the
  system to panic on a subsequent allocation from the per-cpu slab cache.

          The call path on CPU B is significantly shorter than that on CPU A
  after wakeup, so it is likely that an external event delays CPU B.  This
  could be either an interrupt within the VM or scheduling delays for the
  virtual cpu process on the hypervisor.  Whatever the delay is, it is not
  the root cause but merely the triggering event.

          The virtscsi race window described above exists in all kernels
  that have been checked (4.4 upstream LTS, Ubuntu 4.13 and 4.15, and
  current mainline as of this writing).  However, none of those kernels
  exhibit the panic in testing, only the Ubuntu 4.4 kernel after commit
  be2a20802abbde.

          The reason none of those kernels panic is they all have one thing
  in common: an incidental call to synchronize_rcu somewhere in the call
  path of CPU A after it is woken up.  This causes CPU A to wait for CPU B's
  activity, as CPU A will not go on to free the "tgt" memory until after the
  RCU grace period passes, which requires waiting for CPU B's activity to
  finish.  Note that the specific RCU sync call is different between some of
  those kernel versions, but all of them have one somewhere deep inside
  blk_cleanup_queue.  The bdi_unregister function has one (in the call to
  bdi_remove_from_list), which is why removing that call opens the race
  window on the Ubuntu 4.4 kernel.

          Resolving the issue can be accomplished by adding an RCU sync
  to virtscsi_target_destroy prior to freeing the target.  It is also possible
  to use a loop of the format:

  +       while (atomic_read(&tgt->reqs))
  +               cpu_relax();

          but this is higher risk as the loop is non-terminating in the case
  of other failure.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1765241/+subscriptions