yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93554
[Bug 2054502] [NEW] shutdowning rabbitmq causes nova-compute.service down
Public bug reported:
Description
===========
We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of nova-compute nodes.
When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted nova-compute.service down for 2 nova-compute nodes.
Upon checking, we found that nova-compute.service is running.
nova-compute.service - OpenStack Compute
Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago
Main PID: 10130 (nova-compute)
Tasks: 32 (limit: 463517)
Memory: 248.2M
CPU: 55min 5.217s
CGroup: /system.slice/nova-compute.service
├─10130 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log
├─11527 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock
└─11702 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock
Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64060)
Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed for user root
Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 20 04:55:31 node002 nova-compute[10130]: timer()
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw)
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch()
Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread
I guess it's possible that when shutting down a RabbitMQ node, nova-compute is experiencing contention or state inconsistencies in processing connection recovery
restarting nova-compute.service can resolve the problem.
Logs & Configs
==============
The nova-compute.log:
2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 10.10.10.52:5672 via [amqp] client with port 35346.
2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
然后systemctl status nova-compute
Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 20 04:55:31 node002 nova-compute[10130]: timer()
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw)
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch()
Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread
Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9)
nova.conf:
[oslo_messaging_rabbit]
[oslo_messaging_notifications]
driver = messagingv2
transport_url = *********
[notifications]
notification_format = unversioned
** Affects: nova
Importance: Undecided
Status: New
** Tags: sts
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054502
Title:
shutdowning rabbitmq causes nova-compute.service down
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of nova-compute nodes.
When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted nova-compute.service down for 2 nova-compute nodes.
Upon checking, we found that nova-compute.service is running.
nova-compute.service - OpenStack Compute
Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago
Main PID: 10130 (nova-compute)
Tasks: 32 (limit: 463517)
Memory: 248.2M
CPU: 55min 5.217s
CGroup: /system.slice/nova-compute.service
├─10130 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log
├─11527 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock
└─11702 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock
Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64060)
Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed for user root
Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 20 04:55:31 node002 nova-compute[10130]: timer()
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw)
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch()
Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread
I guess it's possible that when shutting down a RabbitMQ node, nova-compute is experiencing contention or state inconsistencies in processing connection recovery
restarting nova-compute.service can resolve the problem.
Logs & Configs
==============
The nova-compute.log:
2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 10.10.10.52:5672 via [amqp] client with port 35346.
2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
然后systemctl status nova-compute
Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 20 04:55:31 node002 nova-compute[10130]: timer()
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 20 04:55:31 node002 nova-compute[10130]: cb(*args, **kw)
Feb 20 04:55:31 node002 nova-compute[10130]: File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 20 04:55:31 node002 nova-compute[10130]: waiter.switch()
Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread
Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9)
nova.conf:
[oslo_messaging_rabbit]
[oslo_messaging_notifications]
driver = messagingv2
transport_url = *********
[notifications]
notification_format = unversioned
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054502/+subscriptions
Follow ups