yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2062009] [NEW] Neutron-server + uwsgi deadlocks whenr unning rpc workers

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Sebastian Lohff <2062009@xxxxxxxxxxxxxxxxxx>
Date: Wed, 17 Apr 2024 13:44:17 -0000
Reply-to: Bug 2062009 <2062009@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

In certain situations we observe that neutron-server + uwsgi shares
locks between its native threads and its eventlet threads. As eventlet
relies on being informed when a lock is released, this may lead to a
deadlock, as the evenlet thread waits indefinitely for an already
released lock. In our infrastructure this leads to API requests being
performed on Neutron side, but the caller never gets a response. On
actions like port creations from e.g. Nova or Manila this will lead to
orphaned ports, as the implementation will just try again with creating
the port.

To better debug this we have reintroduced guru meditation reports into
neutron-server[0] and configured uwsgi to send a SIGWINCH on a
harakiri[1] to trigger the guru meditation whenever a uwsgi worker
deadlocks.

The two most interesting candidates seem to be a shared lock inside
oslo_messaging and python's logging lock, which seems to also be called
from oslo_messaging. Both cases identified by the traceback seem to
point to oslo_messaging and its RPC Server (see attached guru
meditation).

As all RPC Servers should run inside neutron-rpc-server anyway (due to
the uwsgi/neutron-rpc-server split) we should move these instances over
there. This will also fix #1864418. One easy way to find instances of
this would be to check via backdoor (or a manual manhole installation,
if backdoor is not available) and search instances of
oslo_messaging.server.MessageHandlingServer via fo(). In our setup (due
to the service_plugins enabled) we see rpc servers running from trunk
and logapi:

>>> [ep for mhs in fo(oslo_messaging.server.MessageHandlingServer) for ep in mhs.dispatcher.endpoints]
[<neutron.services.logapi.rpc.server.LoggingApiSkeleton object at 0x7fb0d465ec10>, <neutron.services.trunk.rpc.server.TrunkSkeleton object at 0x7f622ec11cd0>]

The RPC servers should be started via start_rpc_listeners()

Nova has had similar problems with eventlet and logging in the past, see
here[2][3]. Tests done with Neutron Yoga (or our own brand
stable/yoga-m3), but issue is present in current master.

[0] https://github.com/sapcc/neutron/commit/a7c44263b70089d8106bed6d8d5d0e3ddf44d5ad
[1] https://github.com/sapcc/helm-charts/blob/7a93e91c3af16ad2eb91e0a1d176d56a26faa393/openstack/neutron/templates/etc/_uwsgi.ini.tpl#L46-L50
[2] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/host.py#L197-L201
[3] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/migration.py#L406-L407

** Affects: neutron
     Importance: Undecided
         Status: New

** Attachment added: "guru-meditation-report.txt"
   https://bugs.launchpad.net/bugs/2062009/+attachment/5766806/+files/guru-meditation-report.txt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2062009

Title:
  Neutron-server + uwsgi deadlocks whenr unning rpc workers

Status in neutron:
  New

Bug description:
  In certain situations we observe that neutron-server + uwsgi shares
  locks between its native threads and its eventlet threads. As eventlet
  relies on being informed when a lock is released, this may lead to a
  deadlock, as the evenlet thread waits indefinitely for an already
  released lock. In our infrastructure this leads to API requests being
  performed on Neutron side, but the caller never gets a response. On
  actions like port creations from e.g. Nova or Manila this will lead to
  orphaned ports, as the implementation will just try again with
  creating the port.

  To better debug this we have reintroduced guru meditation reports into
  neutron-server[0] and configured uwsgi to send a SIGWINCH on a
  harakiri[1] to trigger the guru meditation whenever a uwsgi worker
  deadlocks.

  The two most interesting candidates seem to be a shared lock inside
  oslo_messaging and python's logging lock, which seems to also be
  called from oslo_messaging. Both cases identified by the traceback
  seem to point to oslo_messaging and its RPC Server (see attached guru
  meditation).

  As all RPC Servers should run inside neutron-rpc-server anyway (due to
  the uwsgi/neutron-rpc-server split) we should move these instances
  over there. This will also fix #1864418. One easy way to find
  instances of this would be to check via backdoor (or a manual manhole
  installation, if backdoor is not available) and search instances of
  oslo_messaging.server.MessageHandlingServer via fo(). In our setup
  (due to the service_plugins enabled) we see rpc servers running from
  trunk and logapi:

  >>> [ep for mhs in fo(oslo_messaging.server.MessageHandlingServer) for ep in mhs.dispatcher.endpoints]
  [<neutron.services.logapi.rpc.server.LoggingApiSkeleton object at 0x7fb0d465ec10>, <neutron.services.trunk.rpc.server.TrunkSkeleton object at 0x7f622ec11cd0>]

  The RPC servers should be started via start_rpc_listeners()

  Nova has had similar problems with eventlet and logging in the past,
  see here[2][3]. Tests done with Neutron Yoga (or our own brand
  stable/yoga-m3), but issue is present in current master.

  [0] https://github.com/sapcc/neutron/commit/a7c44263b70089d8106bed6d8d5d0e3ddf44d5ad
  [1] https://github.com/sapcc/helm-charts/blob/7a93e91c3af16ad2eb91e0a1d176d56a26faa393/openstack/neutron/templates/etc/_uwsgi.ini.tpl#L46-L50
  [2] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/host.py#L197-L201
  [3] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/migration.py#L406-L407

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2062009/+subscriptions
Follow ups

[Bug 2062009] Re: Neutron-server + uwsgi deadlocks whenr unning rpc workers
From: OpenStack Infra, 2024-05-03