yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #73774
[Bug 1781421] [NEW] CantStartEngineError due to host aggregate up-call when boot from volume and [cinder]/cross_az_attach=False
Public bug reported:
This is semi-related to bug 1497253 but I found it while triaging that
bug to see if it was still an issue since Pike (I don't think it is).
If you run devstack with default superconductor mode configuration, and
configure nova-cpu.conf with:
[cinder]
cross_az_attach=False
Then try to boot from volume where nova-compute creates the volume, it
fails with CantStartEngineError because the cell conductor (n-cond-
cell1.service) is not configured to reach the API DB to get host
aggregate information.
Here is a nova boot command to recreate:
$ nova boot --flavor cirros256 --block-device id=e642acfd-4283-458a-
b7ea-
6c316da3b2ce,source=image,dest=volume,shutdown=remove,size=1,bootindex=0
--poll test-bfv
Where the block device id is the uuid of the cirros image in the
devstack env.
This is the failure in the nova-compute logs:
http://paste.openstack.org/show/725723/
972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky; instance.availabilty_zone: nova
3-c972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Instance failed block device setup: RemoteError: Remote error: CantStartEngineEr
File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n return getattr(target, method)(*args, **kwargs)\n', u' File "/usr/local/lib/python2.7
b9-e23a-4b40-bb42-0df7b65bb36e] Traceback (most recent call last):
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/compute/manager.py", line 1564, in _prep_block_device
b9-e23a-4b40-bb42-0df7b65bb36e] wait_func=self._await_block_device_map_created)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 854, in attach_block_devices
b9-e23a-4b40-bb42-0df7b65bb36e] _log_and_attach(device)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 851, in _log_and_attach
b9-e23a-4b40-bb42-0df7b65bb36e] bdm.attach(*attach_args, **attach_kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 747, in attach
b9-e23a-4b40-bb42-0df7b65bb36e] context, instance, volume_api, virt_driver)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 46, in wrapped
b9-e23a-4b40-bb42-0df7b65bb36e] ret_val = method(obj, context, *args, **kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 623, in attach
b9-e23a-4b40-bb42-0df7b65bb36e] instance=instance)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/volume/cinder.py", line 504, in check_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] instance_az = az.get_instance_availability_zone(context, instance)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/availability_zones.py", line 194, in get_instance_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] az = get_host_availability_zone(elevated, host)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/availability_zones.py", line 95, in get_host_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] key='availability_zone')
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
b9-e23a-4b40-bb42-0df7b65bb36e] args, kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/conductor/rpcapi.py", line 241, in object_class_action_versions
b9-e23a-4b40-bb42-0df7b65bb36e] args=args, kwargs=kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
b9-e23a-4b40-bb42-0df7b65bb36e] retry=self.retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
b9-e23a-4b40-bb42-0df7b65bb36e] retry=retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
b9-e23a-4b40-bb42-0df7b65bb36e] call_monitor_timeout, retry=retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 575, in _send
b9-e23a-4b40-bb42-0df7b65bb36e] raise result
b9-e23a-4b40-bb42-0df7b65bb36e] RemoteError: Remote error: CantStartEngineError No sql_connection parameter is established
b9-e23a-4b40-bb42-0df7b65bb36e] [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n return get
b9-e23a-4b40-bb42-0df7b65bb36e]
The logging at the start is my own for debug:
972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-
4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky;
instance.availabilty_zone: nova
But it shows that the instance.host and instance.availability_zone are
set. The instance.host gets set by the instance_claim in the resource
tracker and the instance.availability_zone get set by conductor at the
top in the schedule_and_build_instances method due to this change in
pike:
https://review.openstack.org/#/c/446053/
So all I have to do to avoid the up-call is this:
diff --git a/nova/availability_zones.py b/nova/availability_zones.py
index 7c8d948..f128d8e 100644
--- a/nova/availability_zones.py
+++ b/nova/availability_zones.py
@@ -165,7 +165,7 @@ def get_availability_zones(context, get_only_available=False,
def get_instance_availability_zone(context, instance):
"""Return availability zone of specified instance."""
host = instance.host if 'host' in instance else None
- if not host:
+ if not host or (host and instance.availability_zone):
# Likely hasn't reached a viable compute node yet so give back the
# desired availability_zone in the instance record if the boot request
# specified one.
This would also fix #5 in our up-call list:
https://docs.openstack.org/nova/latest/user/cellsv2-layout.html
#operations-requiring-upcalls
** Affects: nova
Importance: Medium
Status: Triaged
** Affects: nova/pike
Importance: Medium
Status: Triaged
** Affects: nova/queens
Importance: Medium
Status: Triaged
** Tags: cells cinder compute upcall
** Also affects: nova/queens
Importance: Undecided
Status: New
** Also affects: nova/pike
Importance: Undecided
Status: New
** Changed in: nova/pike
Status: New => Triaged
** Changed in: nova/queens
Status: New => Triaged
** Changed in: nova/queens
Importance: Undecided => Medium
** Changed in: nova/pike
Importance: Undecided => Medium
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1781421
Title:
CantStartEngineError due to host aggregate up-call when boot from
volume and [cinder]/cross_az_attach=False
Status in OpenStack Compute (nova):
Triaged
Status in OpenStack Compute (nova) pike series:
Triaged
Status in OpenStack Compute (nova) queens series:
Triaged
Bug description:
This is semi-related to bug 1497253 but I found it while triaging that
bug to see if it was still an issue since Pike (I don't think it is).
If you run devstack with default superconductor mode configuration,
and configure nova-cpu.conf with:
[cinder]
cross_az_attach=False
Then try to boot from volume where nova-compute creates the volume, it
fails with CantStartEngineError because the cell conductor (n-cond-
cell1.service) is not configured to reach the API DB to get host
aggregate information.
Here is a nova boot command to recreate:
$ nova boot --flavor cirros256 --block-device id=e642acfd-4283-458a-
b7ea-
6c316da3b2ce,source=image,dest=volume,shutdown=remove,size=1,bootindex=0
--poll test-bfv
Where the block device id is the uuid of the cirros image in the
devstack env.
This is the failure in the nova-compute logs:
http://paste.openstack.org/show/725723/
972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky; instance.availabilty_zone: nova
3-c972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-4b40-bb42-0df7b65bb36e] Instance failed block device setup: RemoteError: Remote error: CantStartEngineEr
File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n return getattr(target, method)(*args, **kwargs)\n', u' File "/usr/local/lib/python2.7
b9-e23a-4b40-bb42-0df7b65bb36e] Traceback (most recent call last):
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/compute/manager.py", line 1564, in _prep_block_device
b9-e23a-4b40-bb42-0df7b65bb36e] wait_func=self._await_block_device_map_created)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 854, in attach_block_devices
b9-e23a-4b40-bb42-0df7b65bb36e] _log_and_attach(device)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 851, in _log_and_attach
b9-e23a-4b40-bb42-0df7b65bb36e] bdm.attach(*attach_args, **attach_kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 747, in attach
b9-e23a-4b40-bb42-0df7b65bb36e] context, instance, volume_api, virt_driver)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 46, in wrapped
b9-e23a-4b40-bb42-0df7b65bb36e] ret_val = method(obj, context, *args, **kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/virt/block_device.py", line 623, in attach
b9-e23a-4b40-bb42-0df7b65bb36e] instance=instance)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/volume/cinder.py", line 504, in check_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] instance_az = az.get_instance_availability_zone(context, instance)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/availability_zones.py", line 194, in get_instance_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] az = get_host_availability_zone(elevated, host)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/availability_zones.py", line 95, in get_host_availability_zone
b9-e23a-4b40-bb42-0df7b65bb36e] key='availability_zone')
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
b9-e23a-4b40-bb42-0df7b65bb36e] args, kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/opt/stack/nova/nova/conductor/rpcapi.py", line 241, in object_class_action_versions
b9-e23a-4b40-bb42-0df7b65bb36e] args=args, kwargs=kwargs)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
b9-e23a-4b40-bb42-0df7b65bb36e] retry=self.retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
b9-e23a-4b40-bb42-0df7b65bb36e] retry=retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
b9-e23a-4b40-bb42-0df7b65bb36e] call_monitor_timeout, retry=retry)
b9-e23a-4b40-bb42-0df7b65bb36e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 575, in _send
b9-e23a-4b40-bb42-0df7b65bb36e] raise result
b9-e23a-4b40-bb42-0df7b65bb36e] RemoteError: Remote error: CantStartEngineError No sql_connection parameter is established
b9-e23a-4b40-bb42-0df7b65bb36e] [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/conductor/manager.py", line 124, in _object_dispatch\n return get
b9-e23a-4b40-bb42-0df7b65bb36e]
The logging at the start is my own for debug:
972-4b14-93ad-e7b86edc3a26 service nova] [instance: 910509b9-e23a-
4b40-bb42-0df7b65bb36e] Getting AZ for instance; instance.host: rocky;
instance.availabilty_zone: nova
But it shows that the instance.host and instance.availability_zone are
set. The instance.host gets set by the instance_claim in the resource
tracker and the instance.availability_zone get set by conductor at the
top in the schedule_and_build_instances method due to this change in
pike:
https://review.openstack.org/#/c/446053/
So all I have to do to avoid the up-call is this:
diff --git a/nova/availability_zones.py b/nova/availability_zones.py
index 7c8d948..f128d8e 100644
--- a/nova/availability_zones.py
+++ b/nova/availability_zones.py
@@ -165,7 +165,7 @@ def get_availability_zones(context, get_only_available=False,
def get_instance_availability_zone(context, instance):
"""Return availability zone of specified instance."""
host = instance.host if 'host' in instance else None
- if not host:
+ if not host or (host and instance.availability_zone):
# Likely hasn't reached a viable compute node yet so give back the
# desired availability_zone in the instance record if the boot request
# specified one.
This would also fix #5 in our up-call list:
https://docs.openstack.org/nova/latest/user/cellsv2-layout.html
#operations-requiring-upcalls
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1781421/+subscriptions