yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93858
[Bug 2062009] [NEW] Neutron-server + uwsgi deadlocks whenr unning rpc workers
Public bug reported:
In certain situations we observe that neutron-server + uwsgi shares
locks between its native threads and its eventlet threads. As eventlet
relies on being informed when a lock is released, this may lead to a
deadlock, as the evenlet thread waits indefinitely for an already
released lock. In our infrastructure this leads to API requests being
performed on Neutron side, but the caller never gets a response. On
actions like port creations from e.g. Nova or Manila this will lead to
orphaned ports, as the implementation will just try again with creating
the port.
To better debug this we have reintroduced guru meditation reports into
neutron-server[0] and configured uwsgi to send a SIGWINCH on a
harakiri[1] to trigger the guru meditation whenever a uwsgi worker
deadlocks.
The two most interesting candidates seem to be a shared lock inside
oslo_messaging and python's logging lock, which seems to also be called
from oslo_messaging. Both cases identified by the traceback seem to
point to oslo_messaging and its RPC Server (see attached guru
meditation).
As all RPC Servers should run inside neutron-rpc-server anyway (due to
the uwsgi/neutron-rpc-server split) we should move these instances over
there. This will also fix #1864418. One easy way to find instances of
this would be to check via backdoor (or a manual manhole installation,
if backdoor is not available) and search instances of
oslo_messaging.server.MessageHandlingServer via fo(). In our setup (due
to the service_plugins enabled) we see rpc servers running from trunk
and logapi:
>>> [ep for mhs in fo(oslo_messaging.server.MessageHandlingServer) for ep in mhs.dispatcher.endpoints]
[<neutron.services.logapi.rpc.server.LoggingApiSkeleton object at 0x7fb0d465ec10>, <neutron.services.trunk.rpc.server.TrunkSkeleton object at 0x7f622ec11cd0>]
The RPC servers should be started via start_rpc_listeners()
Nova has had similar problems with eventlet and logging in the past, see
here[2][3]. Tests done with Neutron Yoga (or our own brand
stable/yoga-m3), but issue is present in current master.
[0] https://github.com/sapcc/neutron/commit/a7c44263b70089d8106bed6d8d5d0e3ddf44d5ad
[1] https://github.com/sapcc/helm-charts/blob/7a93e91c3af16ad2eb91e0a1d176d56a26faa393/openstack/neutron/templates/etc/_uwsgi.ini.tpl#L46-L50
[2] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/host.py#L197-L201
[3] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/migration.py#L406-L407
** Affects: neutron
Importance: Undecided
Status: New
** Attachment added: "guru-meditation-report.txt"
https://bugs.launchpad.net/bugs/2062009/+attachment/5766806/+files/guru-meditation-report.txt
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2062009
Title:
Neutron-server + uwsgi deadlocks whenr unning rpc workers
Status in neutron:
New
Bug description:
In certain situations we observe that neutron-server + uwsgi shares
locks between its native threads and its eventlet threads. As eventlet
relies on being informed when a lock is released, this may lead to a
deadlock, as the evenlet thread waits indefinitely for an already
released lock. In our infrastructure this leads to API requests being
performed on Neutron side, but the caller never gets a response. On
actions like port creations from e.g. Nova or Manila this will lead to
orphaned ports, as the implementation will just try again with
creating the port.
To better debug this we have reintroduced guru meditation reports into
neutron-server[0] and configured uwsgi to send a SIGWINCH on a
harakiri[1] to trigger the guru meditation whenever a uwsgi worker
deadlocks.
The two most interesting candidates seem to be a shared lock inside
oslo_messaging and python's logging lock, which seems to also be
called from oslo_messaging. Both cases identified by the traceback
seem to point to oslo_messaging and its RPC Server (see attached guru
meditation).
As all RPC Servers should run inside neutron-rpc-server anyway (due to
the uwsgi/neutron-rpc-server split) we should move these instances
over there. This will also fix #1864418. One easy way to find
instances of this would be to check via backdoor (or a manual manhole
installation, if backdoor is not available) and search instances of
oslo_messaging.server.MessageHandlingServer via fo(). In our setup
(due to the service_plugins enabled) we see rpc servers running from
trunk and logapi:
>>> [ep for mhs in fo(oslo_messaging.server.MessageHandlingServer) for ep in mhs.dispatcher.endpoints]
[<neutron.services.logapi.rpc.server.LoggingApiSkeleton object at 0x7fb0d465ec10>, <neutron.services.trunk.rpc.server.TrunkSkeleton object at 0x7f622ec11cd0>]
The RPC servers should be started via start_rpc_listeners()
Nova has had similar problems with eventlet and logging in the past,
see here[2][3]. Tests done with Neutron Yoga (or our own brand
stable/yoga-m3), but issue is present in current master.
[0] https://github.com/sapcc/neutron/commit/a7c44263b70089d8106bed6d8d5d0e3ddf44d5ad
[1] https://github.com/sapcc/helm-charts/blob/7a93e91c3af16ad2eb91e0a1d176d56a26faa393/openstack/neutron/templates/etc/_uwsgi.ini.tpl#L46-L50
[2] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/host.py#L197-L201
[3] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/migration.py#L406-L407
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2062009/+subscriptions
Follow ups