← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1180501] Re: handling EVENT_LIFECYCLE_STOPPED races with removing instance from database

 

** Changed in: nova
       Status: Fix Committed => Fix Released

** Changed in: nova
    Milestone: None => havana-2

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1180501

Title:
  handling EVENT_LIFECYCLE_STOPPED races with removing instance from
  database

Status in OpenStack Compute (Nova):
  Fix Released

Bug description:
  When an instance is deleted (ComputeManager.terminate_instance), the
  compute driver is told to destroy the instance and wait for its
  destruction, then the instance is removed from the database. Using the
  libvirt+kvm compute driver on stable/grizzly and libvirt-1.0.4, when
  the instance is destroyed, an EVENT_LIFECYCLE_STOPPED is
  asynchronously delivered to nova-compute by libvirt. If the
  EVENT_LIFECYCLE_STOPPED is handled after the instance has been removed
  from the database, then CopmuteManager's lifecycle event handler bombs
  when it tries to look the instance up in the database (see stack trace
  below).

  The error is benign since the lifecycle event handler just syncs the
  power state of instances and this instance already has the correct
  power state (i.e., not in the database at all). However, the stack
  trace is scary and should probably be fixed.

  The more stuff happening in parallel makes the bug easier to observe:
  * It seems easiest to reproduce this bug when lots (10+) of instances are being created and deleted in parallel.
  * I'm fairly certain this race exists regardless of libvirt version (i.e., asynchronous delivery of an event, handling of which queries a potentially deleted record), however I haven't been able to reproduce it with an older version (libvirt-0.9.8). This is probably explained by libvirt-1.0.3+ having a bunch of bottlenecks removed (see Daniel Berrange's patch series on libvirt-devel).
  * I'm working on concurrency bottlenecks in OpenStack. With a bunch of these fixed, the bug happens almost all of the time.

  I haven't tested this on the master branch, but the faulting bits
  (ComputeManager.handle_lifecycle_event) haven't changed.

  This problem could be fixed in a couple of ways:

  1) Wait for the EVENT_LIFECYCLE_STOPPED in LibvirtDriver.destroy rather than polling libvirt (see _wait_for_destroy).
  2) Ignore events in ComputeManager.handle_lifecycle_event for non-existant instances.
  3) maintain a deleted list (i.e., list of instance uuids for which events can be ignored).

  I think (2) is simplest  to implement and most robust. Moreover, (2)
  will handle other the race between any event delivery and the deletion
  of a domain. Although (3) has the advantage that it won't mask
  database deletion bugs, it introduces a garbage collection problem.

  Relevant blueprint: https://blueprints.launchpad.net/nova/+spec
  /compute-driver-events

  InstanceNotFound: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver Traceback (most recent call last):
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/virt/driver.py", line 902, in emit_event
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     self._compute_event_callback(event)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/compute/manager.py", line 577, in handle_events
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     self.handle_lifecycle_event(event)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/compute/manager.py", line 556, in handle_lifecycle_event
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     context, event.get_instance_uuid())
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/conductor/api.py", line 402, in instance_get_by_uuid
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     instance_uuid)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/conductor/rpcapi.py", line 122, in instance_get_by_uuid
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return self.call(context, msg, version='1.2')
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/openstack/common/rpc/proxy.py", line 83, in call
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return rpc.call(context, self._get_topic(topic), msg, timeout)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/openstack/common/rpc/__init__.py", line 95, in wrapper
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return fn(*args, **kwargs)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/openstack/common/rpc/__init__.py", line 162, in call
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return _get_impl().call(CONF, context, topic, msg, timeout)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/openstack/common/rpc/impl_kombu.py", line 798, in call
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     rpc_amqp.get_connection_pool(conf, Connection))
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 612, in call
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     rv = list(rv)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 561, in __iter__
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     raise result
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver InstanceNotFound_Remote: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver Traceback (most recent call last):
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 407, in catch_client_exception
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return func(*args, **kwargs)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/conductor/manager.py", line 107, in instance_get_by_uuid
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     self.db.instance_get_by_uuid(context, instance_uuid))
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/db/api.py", line 598, in instance_get_by_uuid
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return IMPL.instance_get_by_uuid(context, uuid)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/openstack/common/trace/__init__.py", line 258, in wrapper
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     r = fn(*args, **kwargs)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 115, in wrapper
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return f(*args, **kwargs)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1531, in instance_get_by_uuid
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return _instance_get_by_uuid(context, uuid)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 115, in wrapper
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     return f(*args, **kwargs)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver   File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1541, in _instance_get_by_uuid
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver     raise exception.InstanceNotFound(instance_id=uuid)
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver InstanceNotFound: Instance 9b6045c5-4d63-47fd-8df1-5c424c72d5be could not be found.
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.735 3270 TRACE nova.virt.driver
  2013-05-15 11:03:05.742 3270 DEBUG nova.virt.driver [-] Emitting event <nova.virt.event.LifecycleEvent object at 0x4029c50> emit_event /opt/stack/nova/nova/virt/driver.py:901

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1180501/+subscriptions