yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #95918
[Bug 2111617] [NEW] Nova conductor failed to put entry in db during the build
Public bug reported:
Description
===========
Openstack conductor threads hang while building the instance. The instance was stuck in the building state while having the placement allocation in the DB, but the Conductor failed to put an entry in the nova_cell1 instances table. Once the issue happens, we must delete and recreate the stack.
Error in Conductor
Traceback (most recent call last):
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
timer()
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
cb(*args, **kw)
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
waiter.switch()
greenlet.error: cannot switch to a different thread
Steps to reproduce
==================
Shut down rabbitmq for 5 minutes to emulate a failover scenario requiring the conductor threads to connect back after failure.
Once the rabbitmq returns online, wait a minute before spinning the stack on multiple Hypervisor.
Expected result
===============
All VMs are up and running, and all the volumes are attached to the VM.
Actual result
=============
What happened instead of the expected result?
Randomly, on different computers, the VM gets stuck in the build scheduled state with compute logs showing the below errors.
Compute Logs
2025-05-22 17:14:55.703 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:15:55.710 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:16:56.608 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:17:58.701 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:19:00.692 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:20:01.661 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
During the failure, the Conductor got the below greenlet thread error, and it failed to put an entry in the Nova cell database.
2025-05-22 18:41:53.101 7 INFO oslo.messaging._drivers.impl_rabbit [-] [4924e790-3518-4fae-8856-1c4336d4ee72] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37784.
2025-05-22 18:41:53.613 12 INFO oslo.messaging._drivers.impl_rabbit [-] [19734c27-f068-4514-8ddb-78e0dbaeb0db] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37792.
2025-05-22 18:41:53.889 12 INFO oslo.messaging._drivers.impl_rabbit [-] [2bae4372-d3a9-4acd-b575-00a27c8ca11a] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37798.
2025-05-22 18:41:54.525 8 INFO oslo.messaging._drivers.impl_rabbit [-] [7073f50d-61f7-4d51-ace6-5d5f1f0d0ab7] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37800.
Traceback (most recent call last):
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
timer()
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
cb(*args, **kw)
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
waiter.switch()
greenlet.error: cannot switch to a different thread
Environment
===========
1. Exact version of OpenStack you are running. See the following
Openstack caracal.
If this is from a distro please provide
$nova@nova-conductor-7c8949bfd-5pmfr:/$ nova-conductor --version
29.2.1
nova@nova-conductor-7c8949bfd-5pmfr:/$
2. Which hypervisor did you use?
Libvirt + KVM
What's the version of that?
Libvirt : 8.0.0
Kernel : 5.15.0-136-generic
2. Which storage type did you use?
What's the version of that?
Ceph Squid.
3. Which networking type did you use?
Neutron with Openvswitch in Dpdk mode along with SRIOV agent
What we have found so far:
OpenStack uses Eventlet for greenlet threads. However, Eventlet is not
maintained(it Supports only Python 2.x). So, in the recent version of
OpenStack, we are using Mokey patching to call the Eventlet functions
written in Python 2.x within python3.x. This scenario is leading to many
threading issues. Some of the fixes(mentioned below) in OpenStack to
prevent this from happening are already in Antelope, but we still hit
this issue.
https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17
https://opendev.org/openstack/oslo.log/commit/de615d9370681a2834cebe88acfa81b919da340c
https://review.opendev.org/c/openstack/oslo.log/+/914190
However, a significant effort is being made to move OpenStack out of
Eventlet dependency and into asyncio. This effort will take at least
four or more releases, so a fix for this has not yet been released.
Dev can follow defect links to track the changes.
https://github.com/eventlet/eventlet/issues/432
https://github.com/eventlet/eventlet/issues/662
We also note that this defect,
https://github.com/eventlet/eventlet/issues/662, explicitly states that
Eventlet 0.29.0 didn't have this issue. We still need to verify this
statement. We need to roll back to this version of Eventlet in our next
release and make sure everything works fine. We also made changes to the
heartbeat_in_pthread setting in the Nova API in 2.12; we also need to
evaluate whether we require that setting anymore.
But we can't downgrade the eventlet due to dependencies. And it is not even safe to do.
As a data point, we also had the same issue with Antelope.
A workaround for this issue is simply to retry the stack, but since it
is creating many issues in our fail-over readiness testing, we would
like community help to fix it.
** Affects: nova
Importance: Undecided
Status: New
** Tags: conductor
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2111617
Title:
Nova conductor failed to put entry in db during the build
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
Openstack conductor threads hang while building the instance. The instance was stuck in the building state while having the placement allocation in the DB, but the Conductor failed to put an entry in the nova_cell1 instances table. Once the issue happens, we must delete and recreate the stack.
Error in Conductor
Traceback (most recent call last):
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
timer()
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
cb(*args, **kw)
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
waiter.switch()
greenlet.error: cannot switch to a different thread
Steps to reproduce
==================
Shut down rabbitmq for 5 minutes to emulate a failover scenario requiring the conductor threads to connect back after failure.
Once the rabbitmq returns online, wait a minute before spinning the stack on multiple Hypervisor.
Expected result
===============
All VMs are up and running, and all the volumes are attached to the VM.
Actual result
=============
What happened instead of the expected result?
Randomly, on different computers, the VM gets stuck in the build scheduled state with compute logs showing the below errors.
Compute Logs
2025-05-22 17:14:55.703 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:15:55.710 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:16:56.608 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:17:58.701 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:19:00.692 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:20:01.661 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
During the failure, the Conductor got the below greenlet thread error, and it failed to put an entry in the Nova cell database.
2025-05-22 18:41:53.101 7 INFO oslo.messaging._drivers.impl_rabbit [-] [4924e790-3518-4fae-8856-1c4336d4ee72] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37784.
2025-05-22 18:41:53.613 12 INFO oslo.messaging._drivers.impl_rabbit [-] [19734c27-f068-4514-8ddb-78e0dbaeb0db] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37792.
2025-05-22 18:41:53.889 12 INFO oslo.messaging._drivers.impl_rabbit [-] [2bae4372-d3a9-4acd-b575-00a27c8ca11a] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37798.
2025-05-22 18:41:54.525 8 INFO oslo.messaging._drivers.impl_rabbit [-] [7073f50d-61f7-4d51-ace6-5d5f1f0d0ab7] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37800.
Traceback (most recent call last):
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
timer()
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
cb(*args, **kw)
File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
waiter.switch()
greenlet.error: cannot switch to a different thread
Environment
===========
1. Exact version of OpenStack you are running. See the following
Openstack caracal.
If this is from a distro please provide
$nova@nova-conductor-7c8949bfd-5pmfr:/$ nova-conductor --version
29.2.1
nova@nova-conductor-7c8949bfd-5pmfr:/$
2. Which hypervisor did you use?
Libvirt + KVM
What's the version of that?
Libvirt : 8.0.0
Kernel : 5.15.0-136-generic
2. Which storage type did you use?
What's the version of that?
Ceph Squid.
3. Which networking type did you use?
Neutron with Openvswitch in Dpdk mode along with SRIOV agent
What we have found so far:
OpenStack uses Eventlet for greenlet threads. However, Eventlet is not
maintained(it Supports only Python 2.x). So, in the recent version of
OpenStack, we are using Mokey patching to call the Eventlet functions
written in Python 2.x within python3.x. This scenario is leading to
many threading issues. Some of the fixes(mentioned below) in OpenStack
to prevent this from happening are already in Antelope, but we still
hit this issue.
https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17
https://opendev.org/openstack/oslo.log/commit/de615d9370681a2834cebe88acfa81b919da340c
https://review.opendev.org/c/openstack/oslo.log/+/914190
However, a significant effort is being made to move OpenStack out of
Eventlet dependency and into asyncio. This effort will take at least
four or more releases, so a fix for this has not yet been released.
Dev can follow defect links to track the changes.
https://github.com/eventlet/eventlet/issues/432
https://github.com/eventlet/eventlet/issues/662
We also note that this defect,
https://github.com/eventlet/eventlet/issues/662, explicitly states
that Eventlet 0.29.0 didn't have this issue. We still need to verify
this statement. We need to roll back to this version of Eventlet in
our next release and make sure everything works fine. We also made
changes to the heartbeat_in_pthread setting in the Nova API in 2.12;
we also need to evaluate whether we require that setting anymore.
But we can't downgrade the eventlet due to dependencies. And it is not even safe to do.
As a data point, we also had the same issue with Antelope.
A workaround for this issue is simply to retry the stack, but since it
is creating many issues in our fail-over readiness testing, we would
like community help to fix it.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2111617/+subscriptions