← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2054502] [NEW] shutdowning rabbitmq causes nova-compute.service down

 

Public bug reported:

Description
===========
We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of nova-compute nodes.
When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted nova-compute.service down for 2 nova-compute nodes. 

Upon checking, we found that nova-compute.service is running.

nova-compute.service - OpenStack Compute
     Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago
   Main PID: 10130 (nova-compute)
      Tasks: 32 (limit: 463517)
     Memory: 248.2M
        CPU: 55min 5.217s
     CGroup: /system.slice/nova-compute.service
             ├─10130 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log
             ├─11527 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock
             └─11702 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock

Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64060)
Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed for user root
Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 20 04:55:31 node002 nova-compute[10130]:     timer()
Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 20 04:55:31 node002 nova-compute[10130]:     cb(*args, **kw)
Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 20 04:55:31 node002 nova-compute[10130]:     waiter.switch()
Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread

I guess it's possible that when shutting down a RabbitMQ node, nova-compute is experiencing contention or state inconsistencies in processing connection recovery
restarting nova-compute.service can resolve the problem. 

Logs & Configs
==============
The nova-compute.log:

2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 10.10.10.52:5672 via [amqp] client with port 35346.
2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
然后systemctl status nova-compute 
Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 20 04:55:31 node002 nova-compute[10130]:     timer()
Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 20 04:55:31 node002 nova-compute[10130]:     cb(*args, **kw)
Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 20 04:55:31 node002 nova-compute[10130]:     waiter.switch()
Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread


Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9)

nova.conf:

[oslo_messaging_rabbit]


[oslo_messaging_notifications]
driver = messagingv2
transport_url = *********

[notifications]
notification_format = unversioned

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: sts

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054502

Title:
  shutdowning rabbitmq causes nova-compute.service down

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of nova-compute nodes.
  When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted nova-compute.service down for 2 nova-compute nodes. 

  Upon checking, we found that nova-compute.service is running.

  nova-compute.service - OpenStack Compute
       Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
       Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago
     Main PID: 10130 (nova-compute)
        Tasks: 32 (limit: 463517)
       Memory: 248.2M
          CPU: 55min 5.217s
       CGroup: /system.slice/nova-compute.service
               ├─10130 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log
               ├─11527 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock
               └─11702 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock

  Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64060)
  Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed for user root
  Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
  Feb 20 04:55:31 node002 nova-compute[10130]:     timer()
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
  Feb 20 04:55:31 node002 nova-compute[10130]:     cb(*args, **kw)
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
  Feb 20 04:55:31 node002 nova-compute[10130]:     waiter.switch()
  Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread

  I guess it's possible that when shutting down a RabbitMQ node, nova-compute is experiencing contention or state inconsistencies in processing connection recovery
  restarting nova-compute.service can resolve the problem. 

  Logs & Configs
  ==============
  The nova-compute.log:

  2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
  2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
  2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 10.10.10.52:5672 via [amqp] client with port 35346.
  2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
  然后systemctl status nova-compute 
  Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
  Feb 20 04:55:31 node002 nova-compute[10130]:     timer()
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
  Feb 20 04:55:31 node002 nova-compute[10130]:     cb(*args, **kw)
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
  Feb 20 04:55:31 node002 nova-compute[10130]:     waiter.switch()
  Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread


  Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9)

  nova.conf:

  [oslo_messaging_rabbit]

  
  [oslo_messaging_notifications]
  driver = messagingv2
  transport_url = *********

  [notifications]
  notification_format = unversioned

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054502/+subscriptions



Follow ups