yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1781421] [NEW] CantStartEngineError due to host aggregate up-call when boot from volume and [cinder]/cross_az_attach=False

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Thu, 12 Jul 2018 15:19:57 -0000
Reply-to: Bug 1781421 <1781421@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

This is semi-related to bug 1497253 but I found it while triaging that
bug to see if it was still an issue since Pike (I don't think it is).

If you run devstack with default superconductor mode configuration, and
configure nova-cpu.conf with:

[cinder]
cross_az_attach=False

Then try to boot from volume where nova-compute creates the volume, it
fails with CantStartEngineError because the cell conductor (n-cond-
cell1.service) is not configured to reach the API DB to get host
aggregate information.

Here is a nova boot command to recreate:

$ nova boot --flavor cirros256 --block-device id=e642acfd-4283-458a-
b7ea-
6c316da3b2ce,source=image,dest=volume,shutdown=remove,size=1,bootindex=0
--poll test-bfv

Where the block device id is the uuid of the cirros image in the
devstack env.

This is the failure in the nova-compute logs:

http://paste.openstack.org/show/725723/

972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky; instance.availabilty_zone: nova
3-c972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Instance failed block device setup: RemoteError: Remote error: CantStartEngineEr
  File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n    return getattr(target, method)(*args, **kwargs)\n', u'  File "/usr/local/lib/python2.7
b9-e23a-4b40-bb42-0df7b65bb36e] Traceback (most recent call last):
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/compute/manager.py", line 1564, in _prep_block_device
b9-e23a-4b40-bb42-0df7b65bb36e]     wait_func=self._await_block_device_map_created)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 854, in attach_block_devices
b9-e23a-4b40-bb42-0df7b65bb36e]     _log_and_attach(device)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 851, in _log_and_attach
b9-e23a-4b40-bb42-0df7b65bb36e]     bdm.attach(*attach_args, **attach_kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 747, in attach
b9-e23a-4b40-bb42-0df7b65bb36e]     context, instance, volume_api, virt_driver)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 46, in wrapped
b9-e23a-4b40-bb42-0df7b65bb36e]     ret_val = method(obj, context, *args, **kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 623, in attach
b9-e23a-4b40-bb42-0df7b65bb36e]     instance=instance)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/volume/cinder.py", line 504, in check_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e]     instance_az = az.get_instance_availability_zone(context, instance)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/availability_zones.py", line 194, in get_instance_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e]     az = get_host_availability_zone(elevated, host)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/availability_zones.py", line 95, in get_host_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e]     key='availability_zone')
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
b9-e23a-4b40-bb42-0df7b65bb36e]     args, kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/conductor/rpcapi.py", line 241, in object_class_action_versions
b9-e23a-4b40-bb42-0df7b65bb36e]     args=args, kwargs=kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
b9-e23a-4b40-bb42-0df7b65bb36e]     retry=self.retry)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
b9-e23a-4b40-bb42-0df7b65bb36e]     retry=retry)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
b9-e23a-4b40-bb42-0df7b65bb36e]     call_monitor_timeout, retry=retry)
b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 575, in _send
b9-e23a-4b40-bb42-0df7b65bb36e]     raise result
b9-e23a-4b40-bb42-0df7b65bb36e] RemoteError: Remote error: CantStartEngineError No sql_connection parameter is established
b9-e23a-4b40-bb42-0df7b65bb36e] [u'Traceback (most recent call last):\n', u'  File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n    return get
b9-e23a-4b40-bb42-0df7b65bb36e] 

The logging at the start is my own for debug:

972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-
4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky;
instance.availabilty_zone: nova

But it shows that the instance.host and instance.availability_zone are
set. The instance.host gets set by the instance_claim in the resource
tracker and the instance.availability_zone get set by conductor at the
top in the schedule_and_build_instances method due to this change in
pike:

https://review.openstack.org/#/c/446053/

So all I have to do to avoid the up-call is this:

diff --git a/nova/availability_zones.py b/nova/availability_zones.py
index 7c8d948..f128d8e 100644
--- a/nova/availability_zones.py
+++ b/nova/availability_zones.py
@@ -165,7 +165,7 @@ def get_availability_zones(context, get_only_available=False,
 def get_instance_availability_zone(context, instance):
     """Return availability zone of specified instance."""
     host = instance.host if 'host' in instance else None
-    if not host:
+    if not host or (host and instance.availability_zone):
         # Likely hasn't reached a viable compute node yet so give back the
         # desired availability_zone in the instance record if the boot request
         # specified one.

This would also fix #5 in our up-call list:

https://docs.openstack.org/nova/latest/user/cellsv2-layout.html
#operations-requiring-upcalls

** Affects: nova
     Importance: Medium
         Status: Triaged

** Affects: nova/pike
     Importance: Medium
         Status: Triaged

** Affects: nova/queens
     Importance: Medium
         Status: Triaged


** Tags: cells cinder compute upcall

** Also affects: nova/queens
   Importance: Undecided
       Status: New

** Also affects: nova/pike
   Importance: Undecided
       Status: New

** Changed in: nova/pike
       Status: New => Triaged

** Changed in: nova/queens
       Status: New => Triaged

** Changed in: nova/queens
   Importance: Undecided => Medium

** Changed in: nova/pike
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1781421

Title:
  CantStartEngineError due to host aggregate up-call when boot from
  volume and [cinder]/cross_az_attach=False

Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) pike series:
  Triaged
Status in OpenStack Compute (nova) queens series:
  Triaged

Bug description:
  This is semi-related to bug 1497253 but I found it while triaging that
  bug to see if it was still an issue since Pike (I don't think it is).

  If you run devstack with default superconductor mode configuration,
  and configure nova-cpu.conf with:

  [cinder]
  cross_az_attach=False

  Then try to boot from volume where nova-compute creates the volume, it
  fails with CantStartEngineError because the cell conductor (n-cond-
  cell1.service) is not configured to reach the API DB to get host
  aggregate information.

  Here is a nova boot command to recreate:

  $ nova boot --flavor cirros256 --block-device id=e642acfd-4283-458a-
  b7ea-
  6c316da3b2ce,source=image,dest=volume,shutdown=remove,size=1,bootindex=0
  --poll test-bfv

  Where the block device id is the uuid of the cirros image in the
  devstack env.

  This is the failure in the nova-compute logs:

  http://paste.openstack.org/show/725723/

  972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky; instance.availabilty_zone: nova
  3-c972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Instance failed block device setup: RemoteError: Remote error: CantStartEngineEr
    File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n    return getattr(target, method)(*args, **kwargs)\n', u'  File "/usr/local/lib/python2.7
  b9-e23a-4b40-bb42-0df7b65bb36e] Traceback (most recent call last):
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/compute/manager.py", line 1564, in _prep_block_device
  b9-e23a-4b40-bb42-0df7b65bb36e]     wait_func=self._await_block_device_map_created)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 854, in attach_block_devices
  b9-e23a-4b40-bb42-0df7b65bb36e]     _log_and_attach(device)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 851, in _log_and_attach
  b9-e23a-4b40-bb42-0df7b65bb36e]     bdm.attach(*attach_args, **attach_kwargs)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 747, in attach
  b9-e23a-4b40-bb42-0df7b65bb36e]     context, instance, volume_api, virt_driver)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 46, in wrapped
  b9-e23a-4b40-bb42-0df7b65bb36e]     ret_val = method(obj, context, *args, **kwargs)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/virt/block_device.py", line 623, in attach
  b9-e23a-4b40-bb42-0df7b65bb36e]     instance=instance)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/volume/cinder.py", line 504, in check_availability_zone
  b9-e23a-4b40-bb42-0df7b65bb36e]     instance_az = az.get_instance_availability_zone(context, instance)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/availability_zones.py", line 194, in get_instance_availability_zone
  b9-e23a-4b40-bb42-0df7b65bb36e]     az = get_host_availability_zone(elevated, host)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/availability_zones.py", line 95, in get_host_availability_zone
  b9-e23a-4b40-bb42-0df7b65bb36e]     key='availability_zone')
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
  b9-e23a-4b40-bb42-0df7b65bb36e]     args, kwargs)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/opt/stack/nova/nova/conductor/rpcapi.py", line 241, in object_class_action_versions
  b9-e23a-4b40-bb42-0df7b65bb36e]     args=args, kwargs=kwargs)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
  b9-e23a-4b40-bb42-0df7b65bb36e]     retry=self.retry)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
  b9-e23a-4b40-bb42-0df7b65bb36e]     retry=retry)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
  b9-e23a-4b40-bb42-0df7b65bb36e]     call_monitor_timeout, retry=retry)
  b9-e23a-4b40-bb42-0df7b65bb36e]   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 575, in _send
  b9-e23a-4b40-bb42-0df7b65bb36e]     raise result
  b9-e23a-4b40-bb42-0df7b65bb36e] RemoteError: Remote error: CantStartEngineError No sql_connection parameter is established
  b9-e23a-4b40-bb42-0df7b65bb36e] [u'Traceback (most recent call last):\n', u'  File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n    return get
  b9-e23a-4b40-bb42-0df7b65bb36e] 

  The logging at the start is my own for debug:

  972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-
  4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky;
  instance.availabilty_zone: nova

  But it shows that the instance.host and instance.availability_zone are
  set. The instance.host gets set by the instance_claim in the resource
  tracker and the instance.availability_zone get set by conductor at the
  top in the schedule_and_build_instances method due to this change in
  pike:

  https://review.openstack.org/#/c/446053/

  So all I have to do to avoid the up-call is this:

  diff --git a/nova/availability_zones.py b/nova/availability_zones.py
  index 7c8d948..f128d8e 100644
  --- a/nova/availability_zones.py
  +++ b/nova/availability_zones.py
  @@ -165,7 +165,7 @@ def get_availability_zones(context, get_only_available=False,
   def get_instance_availability_zone(context, instance):
       """Return availability zone of specified instance."""
       host = instance.host if 'host' in instance else None
  -    if not host:
  +    if not host or (host and instance.availability_zone):
           # Likely hasn't reached a viable compute node yet so give back the
           # desired availability_zone in the instance record if the boot request
           # specified one.

  This would also fix #5 in our up-call list:

  https://docs.openstack.org/nova/latest/user/cellsv2-layout.html
  #operations-requiring-upcalls

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1781421/+subscriptions