← Back to team overview

kernel-packages team mailing list archive

[Bug 1581243] [NEW] [Hyper-V] PCI Passthrough kernel hang and explicit barriers

 

Public bug reported:

Two upstream commits (right now in Bjorn Helgaas's PCI tree, and heading
to Linus's tree) address potential hangs in PCI passthrough. Please
consider these upstream items for 16.10 and 16.04 (and HWE kernels based
on lts-xenial).

https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci
/host-hv&id=deb22e5c84c884a129d801cf3bfde7411536998d

PCI: hv: Report resources release after stopping the bus
Kernel hang is observed when pci-hyperv module is release with device
drivers still attached.  E.g., when I do 'rmmod pci_hyperv' with BCM5720
device pass-through-ed (tg3 module) I see the following:

 NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [rmmod:2104]
 ...
 Call Trace:
  [<ffffffffa0641487>] tg3_read_mem+0x87/0x100 [tg3]
  [<ffffffffa063f000>] ? 0xffffffffa063f000
  [<ffffffffa0644375>] tg3_poll_fw+0x85/0x150 [tg3]
  [<ffffffffa0649877>] tg3_chip_reset+0x357/0x8c0 [tg3]
  [<ffffffffa064ca8b>] tg3_halt+0x3b/0x190 [tg3]
  [<ffffffffa0657611>] tg3_stop+0x171/0x230 [tg3]
  ...
  [<ffffffffa064c550>] tg3_remove_one+0x90/0x140 [tg3]
  [<ffffffff813bee59>] pci_device_remove+0x39/0xc0
  [<ffffffff814a3201>] __device_release_driver+0xa1/0x160
  [<ffffffff814a32e3>] device_release_driver+0x23/0x30
  [<ffffffff813b794a>] pci_stop_bus_device+0x8a/0xa0
  [<ffffffff813b7ab6>] pci_stop_root_bus+0x36/0x60
  [<ffffffffa02c3f38>] hv_pci_remove+0x238/0x260 [pci_hyperv]

The problem seems to be that we report local resources release before
stopping the bus and removing devices from it and device drivers may try to
perform some operations with these resources on shutdown.  Move resources
release report after we do pci_stop_root_bus().

Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Acked-by: Jake Oshins <jakeo@xxxxxxxxxxxxx>

https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci
/host-hv&id=bdd74440d9e887b1fa648eefa17421def5f5243c

PCI: hv: Add explicit barriers to config space accesspci/host-hv
I'm trying to pass-through Broadcom BCM5720 NIC (Dell device 1f5b) on a
Dell R720 server.  Everything works fine when the target VM has only one
CPU, but SMP guests reboot when the NIC driver accesses PCI config space
with hv_pcifront_read_config()/hv_pcifront_write_config().  The reboot
appears to be induced by the hypervisor and no crash is observed.  Windows
event logs are not helpful at all ('Virtual machine ... has quit
unexpectedly').  The particular access point is always different and
putting debug between them (printk/mdelay/...) moves the issue further
away.  The server model affects the issue as well: on Dell R420 I'm able to
pass-through BCM5720 NIC to SMP guests without issues.

While I'm obviously failing to reveal the essence of the issue I was able
to come up with a (possible) solution: if explicit barriers are added to
hv_pcifront_read_config()/hv_pcifront_write_config() the issue goes away.
The essential minimum is rmb() at the end on _hv_pcifront_read_config() and
wmb() at the end of _hv_pcifront_write_config() but I'm not confident it
will be sufficient for all hardware.  I suggest the following barriers:

1) wmb()/mb() between choosing the function and writing to its space.
2) mb() before releasing the spinlock in both _hv_pcifront_read_config()/
   _hv_pcifront_write_config() to ensure that consecutive reads/writes to
  the space won't get re-ordered as drivers may count on that.

Config space access is not supposed to be performance-critical so these
explicit barriers should not cause any slowdown.

[bhelgaas: use Linux "barriers" terminology]
Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Acked-by: Jake Oshins <jakeo@xxxxxxxxxxxxx>

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1581243

Title:
  [Hyper-V] PCI Passthrough kernel hang and explicit barriers

Status in linux package in Ubuntu:
  New

Bug description:
  Two upstream commits (right now in Bjorn Helgaas's PCI tree, and
  heading to Linus's tree) address potential hangs in PCI passthrough.
  Please consider these upstream items for 16.10 and 16.04 (and HWE
  kernels based on lts-xenial).

  https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci
  /host-hv&id=deb22e5c84c884a129d801cf3bfde7411536998d

  PCI: hv: Report resources release after stopping the bus
  Kernel hang is observed when pci-hyperv module is release with device
  drivers still attached.  E.g., when I do 'rmmod pci_hyperv' with BCM5720
  device pass-through-ed (tg3 module) I see the following:

   NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [rmmod:2104]
   ...
   Call Trace:
    [<ffffffffa0641487>] tg3_read_mem+0x87/0x100 [tg3]
    [<ffffffffa063f000>] ? 0xffffffffa063f000
    [<ffffffffa0644375>] tg3_poll_fw+0x85/0x150 [tg3]
    [<ffffffffa0649877>] tg3_chip_reset+0x357/0x8c0 [tg3]
    [<ffffffffa064ca8b>] tg3_halt+0x3b/0x190 [tg3]
    [<ffffffffa0657611>] tg3_stop+0x171/0x230 [tg3]
    ...
    [<ffffffffa064c550>] tg3_remove_one+0x90/0x140 [tg3]
    [<ffffffff813bee59>] pci_device_remove+0x39/0xc0
    [<ffffffff814a3201>] __device_release_driver+0xa1/0x160
    [<ffffffff814a32e3>] device_release_driver+0x23/0x30
    [<ffffffff813b794a>] pci_stop_bus_device+0x8a/0xa0
    [<ffffffff813b7ab6>] pci_stop_root_bus+0x36/0x60
    [<ffffffffa02c3f38>] hv_pci_remove+0x238/0x260 [pci_hyperv]

  The problem seems to be that we report local resources release before
  stopping the bus and removing devices from it and device drivers may try to
  perform some operations with these resources on shutdown.  Move resources
  release report after we do pci_stop_root_bus().

  Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
  Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
  Acked-by: Jake Oshins <jakeo@xxxxxxxxxxxxx>

  https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/commit/?h=pci
  /host-hv&id=bdd74440d9e887b1fa648eefa17421def5f5243c

  PCI: hv: Add explicit barriers to config space accesspci/host-hv
  I'm trying to pass-through Broadcom BCM5720 NIC (Dell device 1f5b) on a
  Dell R720 server.  Everything works fine when the target VM has only one
  CPU, but SMP guests reboot when the NIC driver accesses PCI config space
  with hv_pcifront_read_config()/hv_pcifront_write_config().  The reboot
  appears to be induced by the hypervisor and no crash is observed.  Windows
  event logs are not helpful at all ('Virtual machine ... has quit
  unexpectedly').  The particular access point is always different and
  putting debug between them (printk/mdelay/...) moves the issue further
  away.  The server model affects the issue as well: on Dell R420 I'm able to
  pass-through BCM5720 NIC to SMP guests without issues.

  While I'm obviously failing to reveal the essence of the issue I was able
  to come up with a (possible) solution: if explicit barriers are added to
  hv_pcifront_read_config()/hv_pcifront_write_config() the issue goes away.
  The essential minimum is rmb() at the end on _hv_pcifront_read_config() and
  wmb() at the end of _hv_pcifront_write_config() but I'm not confident it
  will be sufficient for all hardware.  I suggest the following barriers:

  1) wmb()/mb() between choosing the function and writing to its space.
  2) mb() before releasing the spinlock in both _hv_pcifront_read_config()/
     _hv_pcifront_write_config() to ensure that consecutive reads/writes to
    the space won't get re-ordered as drivers may count on that.

  Config space access is not supposed to be performance-critical so these
  explicit barriers should not cause any slowdown.

  [bhelgaas: use Linux "barriers" terminology]
  Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
  Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
  Acked-by: Jake Oshins <jakeo@xxxxxxxxxxxxx>

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1581243/+subscriptions


Follow ups