yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2111617] [NEW] Nova conductor failed to put entry in db during the build

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Jasvinder Singh Kwatra <2111617@xxxxxxxxxxxxxxxxxx>
Date: Fri, 23 May 2025 19:25:39 -0000
Reply-to: Bug 2111617 <2111617@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
 ===========
Openstack conductor threads hang while building the instance. The instance was stuck in the building state while having the placement allocation in the DB, but the Conductor failed to put an entry in the nova_cell1 instances table. Once the issue happens, we must delete and recreate the stack. 
Error in Conductor

Traceback (most recent call last):
  File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
    timer()
  File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
    waiter.switch()
greenlet.error: cannot switch to a different thread


Steps to reproduce
 ==================
Shut down rabbitmq for 5 minutes to emulate a failover scenario requiring the conductor threads to connect back after failure.
Once the rabbitmq returns online, wait a minute before spinning the stack on multiple Hypervisor.
Expected result
 ===============
 All VMs are up and running, and all the volumes are attached to the VM.
Actual result
 =============
 What happened instead of the expected result?
 Randomly, on different computers, the VM gets stuck in the build scheduled state with compute logs showing the below errors. 
Compute Logs
2025-05-22 17:14:55.703 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:15:55.710 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:16:56.608 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:17:58.701 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:19:00.692 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
2025-05-22 17:20:01.661 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.

During the failure, the Conductor got the below greenlet thread error, and it failed to put an entry in the Nova cell database.
2025-05-22 18:41:53.101 7 INFO oslo.messaging._drivers.impl_rabbit [-] [4924e790-3518-4fae-8856-1c4336d4ee72] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37784.
2025-05-22 18:41:53.613 12 INFO oslo.messaging._drivers.impl_rabbit [-] [19734c27-f068-4514-8ddb-78e0dbaeb0db] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37792.
2025-05-22 18:41:53.889 12 INFO oslo.messaging._drivers.impl_rabbit [-] [2bae4372-d3a9-4acd-b575-00a27c8ca11a] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37798.
2025-05-22 18:41:54.525 8 INFO oslo.messaging._drivers.impl_rabbit [-] [7073f50d-61f7-4d51-ace6-5d5f1f0d0ab7] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37800.
Traceback (most recent call last):
  File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
    timer()
  File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
    waiter.switch()
greenlet.error: cannot switch to a different thread


Environment
 ===========
 1. Exact version of OpenStack you are running. See the following
 Openstack caracal.
If this is from a distro please provide
 $nova@nova-conductor-7c8949bfd-5pmfr:/$ nova-conductor --version
29.2.1
nova@nova-conductor-7c8949bfd-5pmfr:/$ 
2. Which hypervisor did you use?
 Libvirt + KVM
 What's the version of that?
Libvirt : 8.0.0
Kernel : 5.15.0-136-generic
2. Which storage type did you use?

 What's the version of that? 
Ceph  Squid.
3. Which networking type did you use?
Neutron with Openvswitch in Dpdk mode along with SRIOV agent


What we have found so far:

OpenStack uses Eventlet for greenlet threads. However, Eventlet is not
maintained(it Supports only Python 2.x). So, in the recent version of
OpenStack, we are using Mokey patching to call the Eventlet functions
written in Python 2.x within python3.x. This scenario is leading to many
threading issues. Some of the fixes(mentioned below) in OpenStack to
prevent this from happening are already in Antelope, but we still hit
this issue.


https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17

https://opendev.org/openstack/oslo.log/commit/de615d9370681a2834cebe88acfa81b919da340c

https://review.opendev.org/c/openstack/oslo.log/+/914190


However, a significant effort is being made to move OpenStack out of
Eventlet dependency and into asyncio. This effort will take at least
four or more releases, so a fix for this has not yet been released.

Dev can follow defect links to track the changes.

https://github.com/eventlet/eventlet/issues/432

https://github.com/eventlet/eventlet/issues/662


We also note that this defect,
https://github.com/eventlet/eventlet/issues/662, explicitly states that
Eventlet 0.29.0 didn't have this issue. We still need to verify this
statement. We need to roll back to this version of Eventlet in our next
release and make sure everything works fine. We also made changes to the
heartbeat_in_pthread setting in the Nova API in 2.12; we also need to
evaluate whether we require that setting anymore.


But we can't downgrade the eventlet due to dependencies. And it is not even safe to do. 

As a data point, we also had the same issue with Antelope.

A workaround for this issue is simply to retry the stack, but since it
is creating many issues in our fail-over readiness testing, we would
like community help to fix it.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: conductor

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2111617

Title:
  Nova conductor failed to put entry in db during the build

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
   ===========
  Openstack conductor threads hang while building the instance. The instance was stuck in the building state while having the placement allocation in the DB, but the Conductor failed to put an entry in the nova_cell1 instances table. Once the issue happens, we must delete and recreate the stack. 
  Error in Conductor

  Traceback (most recent call last):
    File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
      timer()
    File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
      waiter.switch()
  greenlet.error: cannot switch to a different thread

  
  Steps to reproduce
   ==================
  Shut down rabbitmq for 5 minutes to emulate a failover scenario requiring the conductor threads to connect back after failure.
  Once the rabbitmq returns online, wait a minute before spinning the stack on multiple Hypervisor.
  Expected result
   ===============
   All VMs are up and running, and all the volumes are attached to the VM.
  Actual result
   =============
   What happened instead of the expected result?
   Randomly, on different computers, the VM gets stuck in the build scheduled state with compute logs showing the below errors. 
  Compute Logs
  2025-05-22 17:14:55.703 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
  2025-05-22 17:15:55.710 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
  2025-05-22 17:16:56.608 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
  2025-05-22 17:17:58.701 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
  2025-05-22 17:19:00.692 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.
  2025-05-22 17:20:01.661 3654615 INFO nova.compute.resource_tracker [None req-6c13fff6-0035-40ff-bc78-bc2c671b8f1a - - - - - -] Instance bbb2b3f7-6c9e-40c3-b4fd-839a47ca9c44 has allocations against this compute host but is not found in the database.

  During the failure, the Conductor got the below greenlet thread error, and it failed to put an entry in the Nova cell database.
  2025-05-22 18:41:53.101 7 INFO oslo.messaging._drivers.impl_rabbit [-] [4924e790-3518-4fae-8856-1c4336d4ee72] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37784.
  2025-05-22 18:41:53.613 12 INFO oslo.messaging._drivers.impl_rabbit [-] [19734c27-f068-4514-8ddb-78e0dbaeb0db] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37792.
  2025-05-22 18:41:53.889 12 INFO oslo.messaging._drivers.impl_rabbit [-] [2bae4372-d3a9-4acd-b575-00a27c8ca11a] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37798.
  2025-05-22 18:41:54.525 8 INFO oslo.messaging._drivers.impl_rabbit [-] [7073f50d-61f7-4d51-ace6-5d5f1f0d0ab7] Reconnected to AMQP server on openstack-rabbitmq.openstack.svc.cluster.local:5672 via [amqp] client with port 37800.
  Traceback (most recent call last):
    File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 471, in fire_timers
      timer()
    File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/hubs/timer.py", line 59, in __call__
      cb(*args, **kw)
    File "/var/lib/openstack/lib/python3.10/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
      waiter.switch()
  greenlet.error: cannot switch to a different thread

  
  Environment
   ===========
   1. Exact version of OpenStack you are running. See the following
   Openstack caracal.
  If this is from a distro please provide
   $nova@nova-conductor-7c8949bfd-5pmfr:/$ nova-conductor --version
  29.2.1
  nova@nova-conductor-7c8949bfd-5pmfr:/$ 
  2. Which hypervisor did you use?
   Libvirt + KVM
   What's the version of that?
  Libvirt : 8.0.0
  Kernel : 5.15.0-136-generic
  2. Which storage type did you use?

   What's the version of that? 
  Ceph  Squid.
  3. Which networking type did you use?
  Neutron with Openvswitch in Dpdk mode along with SRIOV agent


  
  What we have found so far:

  OpenStack uses Eventlet for greenlet threads. However, Eventlet is not
  maintained(it Supports only Python 2.x). So, in the recent version of
  OpenStack, we are using Mokey patching to call the Eventlet functions
  written in Python 2.x within python3.x. This scenario is leading to
  many threading issues. Some of the fixes(mentioned below) in OpenStack
  to prevent this from happening are already in Antelope, but we still
  hit this issue.


  https://opendev.org/openstack/oslo.log/commit/94b9dc32ec1f52a582adbd97fe2847f7c87d6c17

  https://opendev.org/openstack/oslo.log/commit/de615d9370681a2834cebe88acfa81b919da340c

  https://review.opendev.org/c/openstack/oslo.log/+/914190


  However, a significant effort is being made to move OpenStack out of
  Eventlet dependency and into asyncio. This effort will take at least
  four or more releases, so a fix for this has not yet been released.

  Dev can follow defect links to track the changes.

  https://github.com/eventlet/eventlet/issues/432

  https://github.com/eventlet/eventlet/issues/662


  We also note that this defect,
  https://github.com/eventlet/eventlet/issues/662, explicitly states
  that Eventlet 0.29.0 didn't have this issue. We still need to verify
  this statement. We need to roll back to this version of Eventlet in
  our next release and make sure everything works fine. We also made
  changes to the heartbeat_in_pthread setting in the Nova API in 2.12;
  we also need to evaluate whether we require that setting anymore.

  
  But we can't downgrade the eventlet due to dependencies. And it is not even safe to do. 

  As a data point, we also had the same issue with Antelope.

  A workaround for this issue is simply to retry the stack, but since it
  is creating many issues in our fail-over readiness testing, we would
  like community help to fix it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2111617/+subscriptions