yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #31704
[Bug 1443186] [NEW] rebooted instances are shutdown by libvirt lifecycle event handling
Public bug reported:
This is a continuation of bug 1293480 (which created bug 1433049).
Those were reported against xen domains with the libvirt driver but we
have a recreate with CONF.libvirt.virt_type=kvm, see the attached logs
and reference the instance with uuid
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78.
In this case, we're running a stress test of soft rebooting 30 active
instances at once. Because of a delay in the libvirt lifecycle event
handling, they are all shutdown after the reboot operation is complete
and the instances go from ACTIVE to SHUTDOWN.
This was reported to me against Icehouse code but the recreate is
against Juno code with patch:
https://review.openstack.org/#/c/169782/
For better logging.
Snippets from the log:
2015-04-10 21:02:38.234 11195 AUDIT nova.compute.manager [req-b24d4f8d-
4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Rebooting instance
2015-04-10 21:03:47.703 11195 DEBUG nova.compute.manager [req-8219e6cf-
dce8-44e7-a5c1-bf1879e155b2 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Received event network-vif-
unplugged-0b2c7633-a5bc-4150-86b2-c8ba58ffa785 external_instance_event
/usr/lib/python2.6/site-packages/nova/compute/manager.py:6285
2015-04-10 21:03:49.299 11195 INFO nova.virt.libvirt.driver [req-
b24d4f8d-4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance shutdown successfully.
2015-04-10 21:03:53.251 11195 DEBUG nova.compute.manager [req-521a6bdb-
172f-4c0c-9bef-855087d7dff0 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Received event network-vif-
plugged-0b2c7633-a5bc-4150-86b2-c8ba58ffa785 external_instance_event
/usr/lib/python2.6/site-packages/nova/compute/manager.py:6285
2015-04-10 21:03:53.259 11195 INFO nova.virt.libvirt.driver [-]
[instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance running
successfully.
2015-04-10 21:03:53.261 11195 INFO nova.virt.libvirt.driver [req-
b24d4f8d-4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance soft rebooted
successfully.
**
At this point we have successfully soft rebooted the instance
**
now we get a lifecycle event from libvirt that the instance is stopped,
since we're no longer running a task we assume the hypervisor is correct
and we call the stop API
2015-04-10 21:04:01.133 11195 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1428699829.13, 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78 => Stopped> emit_event /usr/lib/python2.6/site-packages/nova/virt/driver.py:1298
2015-04-10 21:04:01.134 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] VM Stopped (Lifecycle Event)
2015-04-10 21:04:01.245 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Synchronizing instance power state after lifecycle event "Stopped"; current vm_state: active, current task_state: None, current DB power_state: 1, VM power_state: 4
2015-04-10 21:04:01.334 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] During _sync_instance_power_state the DB power_state (1) does not match the vm_power_state from the hypervisor (4). Updating power_state in the DB to match the hypervisor.
2015-04-10 21:04:01.463 11195 WARNING nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4
**
now we get a lifecycle event from libvirt that the instance is started, but since the instance already has a task_state of 'powering-off' because of the previous stop API call from _sync_instance_power_state, we ignore it.
**
2015-04-10 21:04:02.085 11195 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1428699831.45, 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78 => Started> emit_event /usr/lib/python2.6/site-packages/nova/virt/driver.py:1298
2015-04-10 21:04:02.086 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] VM Started (Lifecycle Event)
2015-04-10 21:04:02.190 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Synchronizing instance power state after lifecycle event "Started"; current vm_state: active, current task_state: powering-off, current DB power_state: 4, VM power_state: 1
2015-04-10 21:04:02.414 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] During sync_power_state the instance has a pending task (powering-off). Skip.
2015-04-10 21:04:05.152 11195 DEBUG nova.compute.manager [req-682ce732-8a99-47d7-82c2-49ebdaea5332 None] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Stopping instance; current vm_state: active, current task_state: powering-off, current DB power_state: 4, current VM power_state: 1 do_stop_instance /usr/lib/python2.6/site-packages/nova/compute/manager.py:2631
**
from here on out, we'd shutdown the instance and that's why it's left in this state.
**
--
There was a change made in kilo https://review.openstack.org/#/c/112946/
that delays sending the stopped lifecycle event waiting for a started
lifecycle event which will cancel out the stopped event if it falls
within a 15 second window. That was only made for xen domains but as
we're seeing this isn't just xen domains, it's also kvm, and therefore
is probably an issue for any libvirt domain in nova.
We made a change to include kvm in the delay logic:
if CONF.libvirt.virt_type in ("xen", "kvm"):
self._lifecycle_delay = 15
else:
self._lifecycle_delay = 0
After that, only 1 out of 372 instances hit the problem and there was a
warning in the logs about a looping call reporting state:
2015-04-12 17:13:31.330 30126 AUDIT nova.compute.manager [req-7d7976d8-6e66-4dd3-958f-f8ef2454dd8c None] [instance: 2189fcd6-8bed-4a1c-916a-1190b40ba568] Rebooting instance
2015-04-12 17:13:52.281 30126 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2015-04-12 17:14:02.296 30126 AUDIT nova.compute.manager [req-9dfd29ff-9bb0-4e10-a16d-4c8f86744176 None] [instance: 4a659b53-75ad-4410-bd91-d85dafe0175f] Rebooting instance
2015-04-12 17:14:28.122 30126 AUDIT nova.compute.resource_tracker [-] Total physical ram (MB): 129026, total allocated virtual ram (MB): 12800
2015-04-12 17:14:28.124 30126 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 4875
2015-04-12 17:14:28.125 30126 AUDIT nova.compute.resource_tracker [-] Total usable vcpus: 32, total allocated vcpus: 6
2015-04-12 17:14:28.126 30126 AUDIT nova.compute.resource_tracker [-] PCI stats: []
2015-04-12 17:14:38.036 30126 AUDIT nova.compute.manager [req-aa25e5f9-5083-46b8-bf29-cc16c2c44c75 None] [instance: 81688c40-3eeb-4d6c-85b3-b2b7dd66eb81] Rebooting instance
2015-04-12 17:14:59.883 30126 WARNING nova.openstack.common.loopingcall [-] task <bound method DbDriver._report_state of <nova.servicegroup.drivers.db.DbDriver object at 0x3401750>> run outlasted interval by 14.99 sec
2015-04-12 17:15:02.486 30126 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2015-04-12 17:15:05.019 30126 WARNING nova.compute.manager [req-c17f5069-7540-42be-b7cf-718bf551df66 None] [instance: 81688c40-3eeb-4d6c-85b3-b2b7dd66eb81] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, current DB power_state: 4, current VM power_state: 4
2015-04-12 17:15:20.602 30126 AUDIT nova.compute.manager [req-34a41e3b-3460-4170-b06e-6e506efc5d69 None] [instance: c9ba059c-e0a2-43e8-a7de-e3710565337f] Rebooting instance
2015-04-12 17:15:40.152 30126 AUDIT nova.compute.manager [req-2d790be7-cc9b-4256-859d-e0b521459f5d None] [instance: e996f312-4dd3-456a-9747-972ff674e9fe] Rebooting instance
2015-04-12 17:15:45.047 30126 AUDIT nova.compute.manager [req-b4ab6799-7d15-471b-b76f-cc4e0e35b359 None] [instance: ef46f5c4-daef-4b42-b5c9-1764cfead244] Rebooting instance
I'll post a change to remove the xen-specific filtering in the lifecycle
delay processing and I also already had a change posted to add a config
option to disable the lifecycle event handling since it's racy and not
tested at scale in the gate so operators should be able to turn it off:
https://review.openstack.org/#/c/159275/
** Affects: nova
Importance: Medium
Assignee: Matt Riedemann (mriedem)
Status: Triaged
** Tags: compute kilo-backport-potential libvirt
** Attachment added: "logforreboot.tgz"
https://bugs.launchpad.net/bugs/1443186/+attachment/4373493/+files/logforreboot.tgz
** Tags added: kilo-backport-potential
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1443186
Title:
rebooted instances are shutdown by libvirt lifecycle event handling
Status in OpenStack Compute (Nova):
Triaged
Bug description:
This is a continuation of bug 1293480 (which created bug 1433049).
Those were reported against xen domains with the libvirt driver but we
have a recreate with CONF.libvirt.virt_type=kvm, see the attached logs
and reference the instance with uuid
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78.
In this case, we're running a stress test of soft rebooting 30 active
instances at once. Because of a delay in the libvirt lifecycle event
handling, they are all shutdown after the reboot operation is complete
and the instances go from ACTIVE to SHUTDOWN.
This was reported to me against Icehouse code but the recreate is
against Juno code with patch:
https://review.openstack.org/#/c/169782/
For better logging.
Snippets from the log:
2015-04-10 21:02:38.234 11195 AUDIT nova.compute.manager [req-
b24d4f8d-4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Rebooting instance
2015-04-10 21:03:47.703 11195 DEBUG nova.compute.manager [req-
8219e6cf-dce8-44e7-a5c1-bf1879e155b2 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Received event network-vif-
unplugged-0b2c7633-a5bc-4150-86b2-c8ba58ffa785 external_instance_event
/usr/lib/python2.6/site-packages/nova/compute/manager.py:6285
2015-04-10 21:03:49.299 11195 INFO nova.virt.libvirt.driver [req-
b24d4f8d-4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance shutdown successfully.
2015-04-10 21:03:53.251 11195 DEBUG nova.compute.manager [req-
521a6bdb-172f-4c0c-9bef-855087d7dff0 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Received event network-vif-
plugged-0b2c7633-a5bc-4150-86b2-c8ba58ffa785 external_instance_event
/usr/lib/python2.6/site-packages/nova/compute/manager.py:6285
2015-04-10 21:03:53.259 11195 INFO nova.virt.libvirt.driver [-]
[instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance running
successfully.
2015-04-10 21:03:53.261 11195 INFO nova.virt.libvirt.driver [req-
b24d4f8d-4a10-44c8-81d7-f79f27e3a3e7 None] [instance:
9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance soft rebooted
successfully.
**
At this point we have successfully soft rebooted the instance
**
now we get a lifecycle event from libvirt that the instance is
stopped, since we're no longer running a task we assume the hypervisor
is correct and we call the stop API
2015-04-10 21:04:01.133 11195 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1428699829.13, 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78 => Stopped> emit_event /usr/lib/python2.6/site-packages/nova/virt/driver.py:1298
2015-04-10 21:04:01.134 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] VM Stopped (Lifecycle Event)
2015-04-10 21:04:01.245 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Synchronizing instance power state after lifecycle event "Stopped"; current vm_state: active, current task_state: None, current DB power_state: 1, VM power_state: 4
2015-04-10 21:04:01.334 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] During _sync_instance_power_state the DB power_state (1) does not match the vm_power_state from the hypervisor (4). Updating power_state in the DB to match the hypervisor.
2015-04-10 21:04:01.463 11195 WARNING nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4
**
now we get a lifecycle event from libvirt that the instance is started, but since the instance already has a task_state of 'powering-off' because of the previous stop API call from _sync_instance_power_state, we ignore it.
**
2015-04-10 21:04:02.085 11195 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1428699831.45, 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78 => Started> emit_event /usr/lib/python2.6/site-packages/nova/virt/driver.py:1298
2015-04-10 21:04:02.086 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] VM Started (Lifecycle Event)
2015-04-10 21:04:02.190 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Synchronizing instance power state after lifecycle event "Started"; current vm_state: active, current task_state: powering-off, current DB power_state: 4, VM power_state: 1
2015-04-10 21:04:02.414 11195 INFO nova.compute.manager [-] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] During sync_power_state the instance has a pending task (powering-off). Skip.
2015-04-10 21:04:05.152 11195 DEBUG nova.compute.manager [req-682ce732-8a99-47d7-82c2-49ebdaea5332 None] [instance: 9ad8f6c5-a5dc-4820-9ea5-fa081e74ec78] Stopping instance; current vm_state: active, current task_state: powering-off, current DB power_state: 4, current VM power_state: 1 do_stop_instance /usr/lib/python2.6/site-packages/nova/compute/manager.py:2631
**
from here on out, we'd shutdown the instance and that's why it's left in this state.
**
--
There was a change made in kilo
https://review.openstack.org/#/c/112946/ that delays sending the
stopped lifecycle event waiting for a started lifecycle event which
will cancel out the stopped event if it falls within a 15 second
window. That was only made for xen domains but as we're seeing this
isn't just xen domains, it's also kvm, and therefore is probably an
issue for any libvirt domain in nova.
We made a change to include kvm in the delay logic:
if CONF.libvirt.virt_type in ("xen", "kvm"):
self._lifecycle_delay = 15
else:
self._lifecycle_delay = 0
After that, only 1 out of 372 instances hit the problem and there was
a warning in the logs about a looping call reporting state:
2015-04-12 17:13:31.330 30126 AUDIT nova.compute.manager [req-7d7976d8-6e66-4dd3-958f-f8ef2454dd8c None] [instance: 2189fcd6-8bed-4a1c-916a-1190b40ba568] Rebooting instance
2015-04-12 17:13:52.281 30126 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2015-04-12 17:14:02.296 30126 AUDIT nova.compute.manager [req-9dfd29ff-9bb0-4e10-a16d-4c8f86744176 None] [instance: 4a659b53-75ad-4410-bd91-d85dafe0175f] Rebooting instance
2015-04-12 17:14:28.122 30126 AUDIT nova.compute.resource_tracker [-] Total physical ram (MB): 129026, total allocated virtual ram (MB): 12800
2015-04-12 17:14:28.124 30126 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 4875
2015-04-12 17:14:28.125 30126 AUDIT nova.compute.resource_tracker [-] Total usable vcpus: 32, total allocated vcpus: 6
2015-04-12 17:14:28.126 30126 AUDIT nova.compute.resource_tracker [-] PCI stats: []
2015-04-12 17:14:38.036 30126 AUDIT nova.compute.manager [req-aa25e5f9-5083-46b8-bf29-cc16c2c44c75 None] [instance: 81688c40-3eeb-4d6c-85b3-b2b7dd66eb81] Rebooting instance
2015-04-12 17:14:59.883 30126 WARNING nova.openstack.common.loopingcall [-] task <bound method DbDriver._report_state of <nova.servicegroup.drivers.db.DbDriver object at 0x3401750>> run outlasted interval by 14.99 sec
2015-04-12 17:15:02.486 30126 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2015-04-12 17:15:05.019 30126 WARNING nova.compute.manager [req-c17f5069-7540-42be-b7cf-718bf551df66 None] [instance: 81688c40-3eeb-4d6c-85b3-b2b7dd66eb81] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, current DB power_state: 4, current VM power_state: 4
2015-04-12 17:15:20.602 30126 AUDIT nova.compute.manager [req-34a41e3b-3460-4170-b06e-6e506efc5d69 None] [instance: c9ba059c-e0a2-43e8-a7de-e3710565337f] Rebooting instance
2015-04-12 17:15:40.152 30126 AUDIT nova.compute.manager [req-2d790be7-cc9b-4256-859d-e0b521459f5d None] [instance: e996f312-4dd3-456a-9747-972ff674e9fe] Rebooting instance
2015-04-12 17:15:45.047 30126 AUDIT nova.compute.manager [req-b4ab6799-7d15-471b-b76f-cc4e0e35b359 None] [instance: ef46f5c4-daef-4b42-b5c9-1764cfead244] Rebooting instance
I'll post a change to remove the xen-specific filtering in the
lifecycle delay processing and I also already had a change posted to
add a config option to disable the lifecycle event handling since it's
racy and not tested at scale in the gate so operators should be able
to turn it off:
https://review.openstack.org/#/c/159275/
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1443186/+subscriptions
Follow ups
References