← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1444630] [NEW] nova-compute should stop handling virt lifecycle events when it's shutting down

 

Public bug reported:

This is a follow on to bug 1293480 and related to bug 1408176 and bug
1443186.

There can be a race when rebooting a compute host where libvirt is
shutting down guest VMs and sending STOPPED lifecycle events up to nova
compute which then tries to stop them via the stop API, which sometimes
works and sometimes doesn't - the compute service can go down with a
vm_state of ACTIVE and task_state of powering-off which isn't resolve on
host reboot.

Sometimes the stop API completes and the instance is stopped with
power_state=4 (shutdown) in the nova database.  When the host comes back
up and libvirt restarts, it starts up the guest VMs which sends the
STARTED lifecycle event and nova handles that but because the vm_state
in the nova database is STOPPED and the power_state is 1 (running) from
the hypervisor, nova things it started up unexpectedly and stops it:

http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=2015.1.0rc1#n6145

So nova shuts the running guest down.

Actually the block in:

http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=2015.1.0rc1#n6145

conflicts with the statement in power_state.py:

http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/power_state.py?id=2015.1.0rc1#n19

"The hypervisor is always considered the authority on the status
of a particular VM, and the power_state in the DB should be viewed as a
snapshot of the VMs's state in the (recent) past."

Anyway, that's a different issue but the point is when nova-compute is
shutting down it should stop accepting lifecycle events from the
hypervisor (virt driver code) since it can't really reliably act on them
anyway - we can leave any sync up that needs to happen in init_host() in
the compute manager.

** Affects: nova
     Importance: Medium
     Assignee: Matt Riedemann (mriedem)
         Status: Triaged


** Tags: compute kilo-backport-potential libvirt

** Changed in: nova
       Status: New => Triaged

** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
     Assignee: (unassigned) => Matt Riedemann (mriedem)

** Tags added: kilo-backport-potential

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1444630

Title:
  nova-compute should stop handling virt lifecycle events when it's
  shutting down

Status in OpenStack Compute (Nova):
  Triaged

Bug description:
  This is a follow on to bug 1293480 and related to bug 1408176 and bug
  1443186.

  There can be a race when rebooting a compute host where libvirt is
  shutting down guest VMs and sending STOPPED lifecycle events up to
  nova compute which then tries to stop them via the stop API, which
  sometimes works and sometimes doesn't - the compute service can go
  down with a vm_state of ACTIVE and task_state of powering-off which
  isn't resolve on host reboot.

  Sometimes the stop API completes and the instance is stopped with
  power_state=4 (shutdown) in the nova database.  When the host comes
  back up and libvirt restarts, it starts up the guest VMs which sends
  the STARTED lifecycle event and nova handles that but because the
  vm_state in the nova database is STOPPED and the power_state is 1
  (running) from the hypervisor, nova things it started up unexpectedly
  and stops it:

  http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=2015.1.0rc1#n6145

  So nova shuts the running guest down.

  Actually the block in:

  http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=2015.1.0rc1#n6145

  conflicts with the statement in power_state.py:

  http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/power_state.py?id=2015.1.0rc1#n19

  "The hypervisor is always considered the authority on the status
  of a particular VM, and the power_state in the DB should be viewed as a
  snapshot of the VMs's state in the (recent) past."

  Anyway, that's a different issue but the point is when nova-compute is
  shutting down it should stop accepting lifecycle events from the
  hypervisor (virt driver code) since it can't really reliably act on
  them anyway - we can leave any sync up that needs to happen in
  init_host() in the compute manager.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1444630/+subscriptions


Follow ups

References