← Back to team overview

group.of.nepali.translators team mailing list archive

[Bug 1797990] Re: kdump fail due to an IRQ storm

 

This bug was fixed in the package linux - 4.15.0-42.45

---------------
linux (4.15.0-42.45) bionic; urgency=medium

  * linux: 4.15.0-42.45 -proposed tracker (LP: #1803592)

  * [FEAT] Guest-dedicated Crypto Adapters (LP: #1787405)
    - KVM: s390: reset crypto attributes for all vcpus
    - KVM: s390: vsie: simulate VCPU SIE entry/exit
    - KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
    - KVM: s390: refactor crypto initialization
    - s390: vfio-ap: base implementation of VFIO AP device driver
    - s390: vfio-ap: register matrix device with VFIO mdev framework
    - s390: vfio-ap: sysfs interfaces to configure adapters
    - s390: vfio-ap: sysfs interfaces to configure domains
    - s390: vfio-ap: sysfs interfaces to configure control domains
    - s390: vfio-ap: sysfs interface to view matrix mdev matrix
    - KVM: s390: interface to clear CRYCB masks
    - s390: vfio-ap: implement mediated device open callback
    - s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctl
    - s390: vfio-ap: zeroize the AP queues
    - s390: vfio-ap: implement VFIO_DEVICE_RESET ioctl
    - KVM: s390: Clear Crypto Control Block when using vSIE
    - KVM: s390: vsie: Do the CRYCB validation first
    - KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
    - KVM: s390: vsie: Allow CRYCB FORMAT-2
    - KVM: s390: vsie: allow CRYCB FORMAT-1
    - KVM: s390: vsie: allow CRYCB FORMAT-0
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
    - KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
    - KVM: s390: device attrs to enable/disable AP interpretation
    - KVM: s390: CPU model support for AP virtualization
    - s390: doc: detailed specifications for AP virtualization
    - KVM: s390: fix locking for crypto setting error path
    - KVM: s390: Tracing APCB changes
    - s390: vfio-ap: setup APCB mask using KVM dedicated function
    - s390/zcrypt: Add ZAPQ inline function.
    - s390/zcrypt: Review inline assembler constraints.
    - s390/zcrypt: Integrate ap_asm.h into include/asm/ap.h.
    - s390/zcrypt: fix ap_instructions_available() returncodes
    - s390/zcrypt: remove VLA usage from the AP bus
    - s390/zcrypt: Remove deprecated ioctls.
    - s390/zcrypt: Remove deprecated zcrypt proc interface.
    - s390/zcrypt: Support up to 256 crypto adapters.
    - [Config:] Enable CONFIG_S390_AP_IOMMU and set CONFIG_VFIO_AP to module.

  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

  *  CVE-2018-18955: nested user namespaces with more than five extents
    incorrectly grant privileges over inode (LP: #1801924) // CVE-2018-18955
    - userns: also map extents in the reverse map to kernel IDs

  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks

 -- Thadeu Lima de Souza Cascardo <cascardo@xxxxxxxxxxxxx>  Thu, 15 Nov
2018 17:01:46 -0200

** Changed in: linux (Ubuntu Bionic)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1797990

Title:
  kdump fail due to an IRQ storm

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Trusty:
  Won't Fix
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  [Impact]

   * A kexec/crash kernel might get stuck and fail to boot
     (for crash kernel, kdump fails to collect a crashdump)
     if a PCI device is buggy/stuck/looping and triggers a
     continuous flood of MSI(X) interrupts (that the kernel
     does not yet know about).

   * This fix allowed to obtain crashdumps when debugging a
     heavy-load scenario, in which a (heavy-loaded) network
     adapter wouldn't stop triggering MSI-X interrupts ever
     after panic()->kdump kicked in.

   * This fix disables MSI(X) in all PCI devices on early
     boot (this is OK as it's (re-)enabled normally later)
     with a kernel cmdline parameter (disabled by default).

  [Test Case]

   * A synthetic test-case is not yet available, however,
     this particular system/workload triggered the problem
     consistently, and it was used for development/testing.

   * We'll update this bug once a synthetic test-case is
     available; we're working on patching QEMU for this.

   * $ cat /proc/cmdline
     <...> pci=clearmsi

     $ dmesg | grep 'Clearing MSI'
     [    0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk)

   * The comparison of 'dmesg -t | sort' has been reviewed
     between option disabled/enabled on boot & kexec modes,
     and only expected differences found (MHz, PIDs, MIPS).

  [Regression Potential]

   * The potential area for regressions is early boot,
     particularly effects of applying quirks during PCI
     bus scan, which is changed/broader w/ these patches.

   * However, all quirks are applied based on PCI ID
     matching, so would only apply if actually targeting
     a new device.

   * Moreover, the new quirk is only applied based on
     a kernel cmdline parameter that is disabled by
     default, which constraints even more when this
     is actually in effect.

  [Other Info]

   * The patch series is still under review/discussion
     upstream, but it's relatively important for Ubuntu
     users at this point, and after internal discussions
     we decided to submit it for SRU.

   * These are links to the linux-pci archive with the
     patches [1, 2, 3]

     [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
         https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@xxxxxxxxxxxxx/

     [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
         https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@xxxxxxxxxxxxx/

     [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
         https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@xxxxxxxxxxxxx/

  [Original Description]

  We have reports of a kdump failure in Ubuntu (in x86 machine) that was
  narrowed down to a MSI irq storm coming from a PCI network device.

  The bug manifests as a lack of progress in the boot process of the
  kdump kernel, and a storm of kernel messages like:

  [...]
  [  342.265294] do_IRQ: 0.155 No irq handler for vector
  [  342.266916] do_IRQ: 0.155 No irq handler for vector
  [  347.258422] do_IRQ: 14053260 callbacks suppressed
  [...]

  The root cause of the issue is that the kdump kernel kexec process
  does not ensure PCI devices are reset and/or MSI capabilities are
  disabled, so a PCI device could produce a huge amount of PCI irqs
  which would take all the processing time for the CPU (specially since
  we restrict the kdump kernel to use one single CPU only).

  This was tested using upstream kernel version 4.18, and the problem reproduces.
  In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+subscriptions