yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1443186] Re: rebooted instances are shutdown by libvirt lifecycle event handling

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Thierry Carrez <thierry.carrez+lp@xxxxxxxxx>
Date: Wed, 24 Jun 2015 12:17:25 -0000
Reply-to: Bug 1443186 <1443186@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** Changed in: nova
       Status: Fix Committed => Fix Released

** Changed in: nova
    Milestone: None => liberty-1

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1443186

Title:
  rebooted instances are shutdown by libvirt lifecycle event handling

Status in OpenStack Compute (Nova):
  Fix Released

Bug description:
  This is a continuation of bug 1293480 (which created bug 1433049).
  Those were reported against xen domains with the libvirt driver but we
  have a recreate with CONF.libvirt.virt_type=kvm, see the attached logs
  and reference the instance with uuid
  9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78.

  In this case, we're running a stress test of soft rebooting 30 active
  instances at once.  Because of a delay in the libvirt lifecycle event
  handling, they are all shutdown after the reboot operation is complete
  and the instances go from ACTIVE to SHUTDOWN.

  This was reported to me against Icehouse code but the recreate is
  against Juno code with patch:

  https://review.openstack.org/#/c/169782/

  For better logging.

  Snippets from the log:

  2015-04-10 21:02:38.234 11195 AUDIT nova.compute.manager [req-
  b24d4f8d-4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
  9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Rebooting instance

  2015-04-10 21:03:47.703 11195 DEBUG nova.compute.manager [req-
  8219e6cf-dce8-44e7-a5c1-bf1879e155b2 None] [instance:
  9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Received event network-vif-
  unplugged-0b2c7633-a5bc-4150-86b2-c8ba58ffa785 external_instance_event
  /usr/lib/python2.6/site-packages/nova/compute/manager.py:6285

  2015-04-10 21:03:49.299 11195 INFO nova.virt.libvirt.driver [req-
  b24d4f8d-4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
  9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance shutdown successfully.

  2015-04-10 21:03:53.251 11195 DEBUG nova.compute.manager [req-
  521a6bdb-172f-4c0c-9bef-855087d7dff0 None] [instance:
  9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Received event network-vif-
  plugged-0b2c7633-a5bc-4150-86b2-c8ba58ffa785 external_instance_event
  /usr/lib/python2.6/site-packages/nova/compute/manager.py:6285

  2015-04-10 21:03:53.259 11195 INFO nova.virt.libvirt.driver [-]
  [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance running
  successfully.

  2015-04-10 21:03:53.261 11195 INFO nova.virt.libvirt.driver [req-
  b24d4f8d-4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
  9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance soft rebooted
  successfully.

  **
  At this point we have successfully soft rebooted the instance
  **

  now we get a lifecycle event from libvirt that the instance is
  stopped, since we're no longer running a task we assume the hypervisor
  is correct and we call the stop API

  2015-04-10 21:04:01.133 11195 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1428699829.13, 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78 => Stopped> emit_event /usr/lib/python2.6/site-packages/nova/virt/driver.py:1298
  2015-04-10 21:04:01.134 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] VM Stopped (Lifecycle Event)
  2015-04-10 21:04:01.245 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Synchronizing instance power state after lifecycle event "Stopped"; current vm_state: active, current task_state: None, current DB power_state: 1, VM power_state: 4
  2015-04-10 21:04:01.334 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] During _sync_instance_power_state the DB power_state (1) does not match the vm_power_state from the hypervisor (4). Updating power_state in the DB to match the hypervisor.
  2015-04-10 21:04:01.463 11195 WARNING nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4

  **
  now we get a lifecycle event from libvirt that the instance is started, but since the instance already has a task_state of 'powering-off' because of the previous stop API call from _sync_instance_power_state, we ignore it.
  **

  
  2015-04-10 21:04:02.085 11195 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1428699831.45, 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78 => Started> emit_event /usr/lib/python2.6/site-packages/nova/virt/driver.py:1298
  2015-04-10 21:04:02.086 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] VM Started (Lifecycle Event)
  2015-04-10 21:04:02.190 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Synchronizing instance power state after lifecycle event "Started"; current vm_state: active, current task_state: powering-off, current DB power_state: 4, VM power_state: 1
  2015-04-10 21:04:02.414 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] During sync_power_state the instance has a pending task (powering-off). Skip.
  2015-04-10 21:04:05.152 11195 DEBUG nova.compute.manager [req-682ce732-8a99-47d7-82c2-49ebdaea5332 None] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Stopping instance; current vm_state: active, current task_state: powering-off, current DB power_state: 4, current VM power_state: 1 do_stop_instance /usr/lib/python2.6/site-packages/nova/compute/manager.py:2631

  **
  from here on out, we'd shutdown the instance and that's why it's left in this state.
  **

  --

  There was a change made in kilo
  https://review.openstack.org/#/c/112946/ that delays sending the
  stopped lifecycle event waiting for a started lifecycle event which
  will cancel out the stopped event if it falls within a 15 second
  window.  That was only made for xen domains but as we're seeing this
  isn't just xen domains, it's also kvm, and therefore is probably an
  issue for any libvirt domain in nova.

  We made a change to include kvm in the delay logic:

          if CONF.libvirt.virt_type in ("xen", "kvm"):
              self._lifecycle_delay = 15
          else:
              self._lifecycle_delay = 0

  After that, only 1 out of 372 instances hit the problem and there was
  a warning in the logs about a looping call reporting state:

  2015-04-12 17:13:31.330 30126 AUDIT nova.compute.manager [req-7d7976d8-6e66-4dd3-958f-f8ef2454dd8c None] [instance: 2189fcd6-8bed-4a1c-916a-1190b40ba568] Rebooting instance
  2015-04-12 17:13:52.281 30126 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
  2015-04-12 17:14:02.296 30126 AUDIT nova.compute.manager [req-9dfd29ff-9bb0-4e10-a16d-4c8f86744176 None] [instance: 4a659b53-75ad-4410-bd91-d85dafe0175f] Rebooting instance
  2015-04-12 17:14:28.122 30126 AUDIT nova.compute.resource_tracker [-] Total physical ram (MB): 129026, total allocated virtual ram (MB): 12800
  2015-04-12 17:14:28.124 30126 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 4875
  2015-04-12 17:14:28.125 30126 AUDIT nova.compute.resource_tracker [-] Total usable vcpus: 32, total allocated vcpus: 6
  2015-04-12 17:14:28.126 30126 AUDIT nova.compute.resource_tracker [-] PCI stats: []
  2015-04-12 17:14:38.036 30126 AUDIT nova.compute.manager [req-aa25e5f9-5083-46b8-bf29-cc16c2c44c75 None] [instance: 81688c40-3eeb-4d6c-85b3-b2b7dd66eb81] Rebooting instance
  2015-04-12 17:14:59.883 30126 WARNING nova.openstack.common.loopingcall [-] task <bound method DbDriver._report_state of <nova.servicegroup.drivers.db.DbDriver object at 0x3401750>> run outlasted interval by 14.99 sec
  2015-04-12 17:15:02.486 30126 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
  2015-04-12 17:15:05.019 30126 WARNING nova.compute.manager [req-c17f5069-7540-42be-b7cf-718bf551df66 None] [instance: 81688c40-3eeb-4d6c-85b3-b2b7dd66eb81] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, current DB power_state: 4, current VM power_state: 4
  2015-04-12 17:15:20.602 30126 AUDIT nova.compute.manager [req-34a41e3b-3460-4170-b06e-6e506efc5d69 None] [instance: c9ba059c-e0a2-43e8-a7de-e3710565337f] Rebooting instance
  2015-04-12 17:15:40.152 30126 AUDIT nova.compute.manager [req-2d790be7-cc9b-4256-859d-e0b521459f5d None] [instance: e996f312-4dd3-456a-9747-972ff674e9fe] Rebooting instance
  2015-04-12 17:15:45.047 30126 AUDIT nova.compute.manager [req-b4ab6799-7d15-471b-b76f-cc4e0e35b359 None] [instance: ef46f5c4-daef-4b42-b5c9-1764cfead244] Rebooting instance


  I'll post a change to remove the xen-specific filtering in the
  lifecycle delay processing and I also already had a change posted to
  add a config option to disable the lifecycle event handling since it's
  racy and not tested at scale in the gate so operators should be able
  to turn it off:

  https://review.openstack.org/#/c/159275/

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1443186/+subscriptions
References

[Bug 1443186] [NEW] rebooted instances are shutdown by libvirt lifecycle event handling
From: Matt Riedemann, 2015-04-12