yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #03859
[Bug 1180501] Re: handling EVENT_LIFECYCLE_STOPPED races with removing instance from database
** Changed in: nova
Status: Fix Committed => Fix Released
** Changed in: nova
Milestone: None => havana-2
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1180501
Title:
handling EVENT_LIFECYCLE_STOPPED races with removing instance from
database
Status in OpenStack Compute (Nova):
Fix Released
Bug description:
When an instance is deleted (ComputeManager.terminate_instance), the
compute driver is told to destroy the instance and wait for its
destruction, then the instance is removed from the database. Using the
libvirt+kvm compute driver on stable/grizzly and libvirt-1.0.4, when
the instance is destroyed, an EVENT_LIFECYCLE_STOPPED is
asynchronously delivered to nova-compute by libvirt. If the
EVENT_LIFECYCLE_STOPPED is handled after the instance has been removed
from the database, then CopmuteManager's lifecycle event handler bombs
when it tries to look the instance up in the database (see stack trace
below).
The error is benign since the lifecycle event handler just syncs the
power state of instances and this instance already has the correct
power state (i.e., not in the database at all). However, the stack
trace is scary and should probably be fixed.
The more stuff happening in parallel makes the bug easier to observe:
* It seems easiest to reproduce this bug when lots (10+) of instances are being created and deleted in parallel.
* I'm fairly certain this race exists regardless of libvirt version (i.e., asynchronous delivery of an event, handling of which queries a potentially deleted record), however I haven't been able to reproduce it with an older version (libvirt-0.9.8). This is probably explained by libvirt-1.0.3+ having a bunch of bottlenecks removed (see Daniel Berrange's patch series on libvirt-devel).
* I'm working on concurrency bottlenecks in OpenStack. With a bunch of these fixed, the bug happens almost all of the time.
I haven't tested this on the master branch, but the faulting bits
(ComputeManager.handle_lifecycle_event) haven't changed.
This problem could be fixed in a couple of ways:
1) Wait for the EVENT_LIFECYCLE_STOPPED in LibvirtDriver.destroy rather than polling libvirt (see _wait_for_destroy).
2) Ignore events in ComputeManager.handle_lifecycle_event for non-existant instances.
3) maintain a deleted list (i.e., list of instance uuids for which events can be ignored).
I think (2) is simplest to implement and most robust. Moreover, (2)
will handle other the race between any event delivery and the deletion
of a domain. Although (3) has the advantage that it won't mask
database deletion bugs, it introduces a garbage collection problem.
Relevant blueprint: https://blueprints.launchpad.net/nova/+spec
/compute-driver-events
InstanceNotFound: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver Traceback (most recent call last):
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/virt/driver.py", line 902, in emit_event
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver self._compute_event_callback(event)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/compute/manager.py", line 577, in handle_events
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver self.handle_lifecycle_event(event)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/compute/manager.py", line 556, in handle_lifecycle_event
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver context, event.get_instance_uuid())
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/conductor/api.py", line 402, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver instance_uuid)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/conductor/rpcapi.py", line 122, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return self.call(context, msg, version='1.2')
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/proxy.py", line 83, in call
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return rpc.call(context, self._get_topic(topic), msg, timeout)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/__init__.py", line 95, in wrapper
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return fn(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/__init__.py", line 162, in call
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return _get_impl().call(CONF, context, topic, msg, timeout)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/impl_kombu.py", line 798, in call
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver rpc_amqp.get_connection_pool(conf, Connection))
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 612, in call
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver rv = list(rv)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 561, in __iter__
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver raise result
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver InstanceNotFound_Remote: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver Traceback (most recent call last):
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 407, in catch_client_exception
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return func(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/conductor/manager.py", line 107, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver self.db.instance_get_by_uuid(context, instance_uuid))
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/api.py", line 598, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return IMPL.instance_get_by_uuid(context, uuid)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/openstack/common/trace/__init__.py", line 258, in wrapper
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver r = fn(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 115, in wrapper
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return f(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1531, in instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return _instance_get_by_uuid(context, uuid)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 115, in wrapper
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver return f(*args, **kwargs)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1541, in _instance_get_by_uuid
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver raise exception.InstanceNotFound(instance_id=uuid)
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver InstanceNotFound: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
2013-05-15 11:03:05.742 3270 DEBUG nova.virt.driver [-] Emitting event <nova.virt.event.LifecycleEvent object at 0x4029c50> emit_event /opt/stack/nova/nova/virt/driver.py:901
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1180501/+subscriptions