yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1853259] [NEW] performance gaps on detect crashed instance

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: "ya.wang" <1853259@xxxxxxxxxxxxxxxxxx>
Date: Wed, 20 Nov 2019 08:55:41 -0000
Reply-to: Bug 1853259 <1853259@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Description
===========
If a QEMU process crashed(oom, etc.), libvirt will send an event which say the instance stopped, and in detail say the instance stopped failed. But nova only handle the stop event, it not check the detail.

When event handler receive a stopped event, it will sleep 15s to ensure the event is not sent by a reboot operation.
https://github.com/openstack/nova/blob/stable/train/nova/virt/libvirt/host.py#L352

As a result, nova will take a long time to detect the crashed instance.

Steps to reproduce
==================
1. Launch a VM
2. Login the compute node, find the corresponding process, and kill the process:
   "kill -SIGBUS pid"

Expected result
===============
The nova service can detect the crashed event in second.

Actual result
=============
Nova need more that 10 seconds to handle the event.

Environment
===========
1. OpenStack cluster version
master build 2019.11.11 (all-in-one)

2. Hypervisor
Libvirt + KVM

3. Storage type
Ceph

4. Networking type
Neutron with OVS

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1853259

Title:
  performance gaps on detect crashed instance

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  If a QEMU process crashed(oom, etc.), libvirt will send an event which say the instance stopped, and in detail say the instance stopped failed. But nova only handle the stop event, it not check the detail.

  When event handler receive a stopped event, it will sleep 15s to ensure the event is not sent by a reboot operation.
  https://github.com/openstack/nova/blob/stable/train/nova/virt/libvirt/host.py#L352

  As a result, nova will take a long time to detect the crashed
  instance.

  Steps to reproduce
  ==================
  1. Launch a VM
  2. Login the compute node, find the corresponding process, and kill the process:
     "kill -SIGBUS pid"

  Expected result
  ===============
  The nova service can detect the crashed event in second.

  Actual result
  =============
  Nova need more that 10 seconds to handle the event.

  Environment
  ===========
  1. OpenStack cluster version
  master build 2019.11.11 (all-in-one)

  2. Hypervisor
  Libvirt + KVM

  3. Storage type
  Ceph

  4. Networking type
  Neutron with OVS

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1853259/+subscriptions