yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93926
[Bug 2062009] Re: Neutron-server + uwsgi deadlocks whenr unning rpc workers
Reviewed: https://review.opendev.org/c/openstack/neutron/+/916112
Committed: https://opendev.org/openstack/neutron/commit/ffcaeda32adf32388c322cfc6f7a8933ef94d2a9
Submitter: "Zuul (22348)"
Branch: master
commit ffcaeda32adf32388c322cfc6f7a8933ef94d2a9
Author: Sebastian Lohff <sebastian.lohff@xxxxxxx>
Date: Mon Apr 15 16:14:50 2024 +0200
Start trunk plugin RPC via service framework
Instead of each individual driver setting up the RPC server (and setting
the _rpc_backend attribute on the TrunkPlugin) we now check in the
TrunkPlugin if any driver requires the RPC backend to be started.
Additionally, we only start it when this is requested by Neutron via
start_rpc_listeners(). This is required when running neutron-server and
neutron-rpc-server separately to run RPC only in neutron-rpc-server.
As we still need the notifiers of ServerSideRpcBackend to be
created/started, we separate TrunkSkeleton (which is the RPC server
implementation) and ServerSideRpcBackend (which is essentially only a
notifier). In case RPC is required by a driver, we always start the
notifier, but the RPC server only when requested via
start_rpc_listeners().
Change-Id: I2c6362b3320e534a6e65bd7701b5ac2feca42a49
Closes-Bug: #2015275
Closes-Bug: #2062009
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2062009
Title:
Neutron-server + uwsgi deadlocks whenr unning rpc workers
Status in neutron:
Fix Released
Bug description:
In certain situations we observe that neutron-server + uwsgi shares
locks between its native threads and its eventlet threads. As eventlet
relies on being informed when a lock is released, this may lead to a
deadlock, as the evenlet thread waits indefinitely for an already
released lock. In our infrastructure this leads to API requests being
performed on Neutron side, but the caller never gets a response. On
actions like port creations from e.g. Nova or Manila this will lead to
orphaned ports, as the implementation will just try again with
creating the port.
To better debug this we have reintroduced guru meditation reports into
neutron-server[0] and configured uwsgi to send a SIGWINCH on a
harakiri[1] to trigger the guru meditation whenever a uwsgi worker
deadlocks.
The two most interesting candidates seem to be a shared lock inside
oslo_messaging and python's logging lock, which seems to also be
called from oslo_messaging. Both cases identified by the traceback
seem to point to oslo_messaging and its RPC Server (see attached guru
meditation).
As all RPC Servers should run inside neutron-rpc-server anyway (due to
the uwsgi/neutron-rpc-server split) we should move these instances
over there. This will also fix #1864418. One easy way to find
instances of this would be to check via backdoor (or a manual manhole
installation, if backdoor is not available) and search instances of
oslo_messaging.server.MessageHandlingServer via fo(). In our setup
(due to the service_plugins enabled) we see rpc servers running from
trunk and logapi:
>>> [ep for mhs in fo(oslo_messaging.server.MessageHandlingServer) for ep in mhs.dispatcher.endpoints]
[<neutron.services.logapi.rpc.server.LoggingApiSkeleton object at 0x7fb0d465ec10>, <neutron.services.trunk.rpc.server.TrunkSkeleton object at 0x7f622ec11cd0>]
The RPC servers should be started via start_rpc_listeners()
Nova has had similar problems with eventlet and logging in the past,
see here[2][3]. Tests done with Neutron Yoga (or our own brand
stable/yoga-m3), but issue is present in current master.
[0] https://github.com/sapcc/neutron/commit/a7c44263b70089d8106bed6d8d5d0e3ddf44d5ad
[1] https://github.com/sapcc/helm-charts/blob/7a93e91c3af16ad2eb91e0a1d176d56a26faa393/openstack/neutron/templates/etc/_uwsgi.ini.tpl#L46-L50
[2] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/host.py#L197-L201
[3] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/migration.py#L406-L407
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2062009/+subscriptions
References