← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2054502] Re: shutdowning rabbitmq causes nova-compute.service down

 

This isn't a Nova bug, maybe some oslo.messaging problem, but anyway, as
the nova-compute service will be off, then the servicegroup API wouldn't
accept it for the scheduler, so this shouldn't be a problem.


** Also affects: oslo.messaging
   Importance: Undecided
       Status: New

** Changed in: nova
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054502

Title:
  shutdowning rabbitmq causes nova-compute.service down

Status in OpenStack Compute (nova):
  Invalid
Status in oslo.messaging:
  New

Bug description:
  Description
  ===========
  We have an OpenStack with a RabbitMQ cluster of 3 nodes, and with dozens of nova-compute nodes.
  When we shut down 1 out of 3 RabbitMQ nodes, Nagios alerted nova-compute.service down for 2 nova-compute nodes. 

  Upon checking, we found that nova-compute.service is running.

  nova-compute.service - OpenStack Compute
       Loaded: loaded (/lib/systemd/system/nova-compute.service; enabled; vendor preset: enabled)
       Active: active (running) since Fri 2024-02-16 00:42:47 UTC; 4 days ago
     Main PID: 10130 (nova-compute)
        Tasks: 32 (limit: 463517)
       Memory: 248.2M
          CPU: 55min 5.217s
       CGroup: /system.slice/nova-compute.service
               ├─10130 /usr/bin/python3 /usr/bin/nova-compute --config-file=/etc/nova/nova.conf --config-file=/etc/nova/nova-compute.conf --log-file=/var/log/nova/nova-compute.log
               ├─11527 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context vif_plug_ovs.privsep.vif_plug --privsep_sock_path /tmp/tmpc0sosqey/privsep.sock
               └─11702 /usr/bin/python3 /bin/privsep-helper --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf --privsep_context nova.privsep.sys_admin_pctxt --privsep_sock_path /tmp/tmp2ik7rchu/privsep.sock

  Feb 16 00:42:53 node002 sudo[11540]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=64060)
  Feb 16 00:42:54 node002 sudo[11540]: pam_unix(sudo:session): session closed for user root
  Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
  Feb 20 04:55:31 node002 nova-compute[10130]:     timer()
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
  Feb 20 04:55:31 node002 nova-compute[10130]:     cb(*args, **kw)
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
  Feb 20 04:55:31 node002 nova-compute[10130]:     waiter.switch()
  Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread

  I guess it's possible that when shutting down a RabbitMQ node, nova-compute is experiencing contention or state inconsistencies in processing connection recovery
  restarting nova-compute.service can resolve the problem. 

  Logs & Configs
  ==============
  The nova-compute.log:

  2024-02-20 04:55:28.675 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
  2024-02-20 04:55:29.677 10130 ERROR oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] AMQP server on 10.10.10.59:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
  2024-02-20 04:55:30.682 10130 INFO oslo.messaging._drivers.impl_rabbit [-] [0aefd459-297a-48e8-8b15-15c763531431] Reconnected to AMQP server on 10.10.10.52:5672 via [amqp] client with port 35346.
  2024-02-20 04:55:31.361 10130 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
  然后systemctl status nova-compute 
  Feb 20 04:55:31 node002 nova-compute[10130]: Traceback (most recent call last):
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
  Feb 20 04:55:31 node002 nova-compute[10130]:     timer()
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
  Feb 20 04:55:31 node002 nova-compute[10130]:     cb(*args, **kw)
  Feb 20 04:55:31 node002 nova-compute[10130]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
  Feb 20 04:55:31 node002 nova-compute[10130]:     waiter.switch()
  Feb 20 04:55:31 node002 nova-compute[10130]: greenlet.error: cannot switch to a different thread


  Jammy + nova-compute(3:25.2.0-0ubuntu1) + rabbitmq-server (3.9)

  nova.conf:

  [oslo_messaging_rabbit]

  
  [oslo_messaging_notifications]
  driver = messagingv2
  transport_url = *********

  [notifications]
  notification_format = unversioned

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054502/+subscriptions



References