← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2062009] Re: Neutron-server + uwsgi deadlocks whenr unning rpc workers


Reviewed:  https://review.opendev.org/c/openstack/neutron/+/916112
Committed: https://opendev.org/openstack/neutron/commit/ffcaeda32adf32388c322cfc6f7a8933ef94d2a9
Submitter: "Zuul (22348)"
Branch:    master

commit ffcaeda32adf32388c322cfc6f7a8933ef94d2a9
Author: Sebastian Lohff <sebastian.lohff@xxxxxxx>
Date:   Mon Apr 15 16:14:50 2024 +0200

    Start trunk plugin RPC via service framework
    Instead of each individual driver setting up the RPC server (and setting
    the _rpc_backend attribute on the TrunkPlugin) we now check in the
    TrunkPlugin if any driver requires the RPC backend to be started.
    Additionally, we only start it when this is requested by Neutron via
    start_rpc_listeners(). This is required when running neutron-server and
    neutron-rpc-server separately to run RPC only in neutron-rpc-server.
    As we still need the notifiers of ServerSideRpcBackend to be
    created/started, we separate TrunkSkeleton (which is the RPC server
    implementation) and ServerSideRpcBackend (which is essentially only a
    notifier). In case RPC is required by a driver, we always start the
    notifier, but the RPC server only when requested via
    Change-Id: I2c6362b3320e534a6e65bd7701b5ac2feca42a49
    Closes-Bug: #2015275
    Closes-Bug: #2062009

** Changed in: neutron
       Status: In Progress => Fix Released

You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.

  Neutron-server + uwsgi deadlocks whenr unning rpc workers

Status in neutron:
  Fix Released

Bug description:
  In certain situations we observe that neutron-server + uwsgi shares
  locks between its native threads and its eventlet threads. As eventlet
  relies on being informed when a lock is released, this may lead to a
  deadlock, as the evenlet thread waits indefinitely for an already
  released lock. In our infrastructure this leads to API requests being
  performed on Neutron side, but the caller never gets a response. On
  actions like port creations from e.g. Nova or Manila this will lead to
  orphaned ports, as the implementation will just try again with
  creating the port.

  To better debug this we have reintroduced guru meditation reports into
  neutron-server[0] and configured uwsgi to send a SIGWINCH on a
  harakiri[1] to trigger the guru meditation whenever a uwsgi worker

  The two most interesting candidates seem to be a shared lock inside
  oslo_messaging and python's logging lock, which seems to also be
  called from oslo_messaging. Both cases identified by the traceback
  seem to point to oslo_messaging and its RPC Server (see attached guru

  As all RPC Servers should run inside neutron-rpc-server anyway (due to
  the uwsgi/neutron-rpc-server split) we should move these instances
  over there. This will also fix #1864418. One easy way to find
  instances of this would be to check via backdoor (or a manual manhole
  installation, if backdoor is not available) and search instances of
  oslo_messaging.server.MessageHandlingServer via fo(). In our setup
  (due to the service_plugins enabled) we see rpc servers running from
  trunk and logapi:

  >>> [ep for mhs in fo(oslo_messaging.server.MessageHandlingServer) for ep in mhs.dispatcher.endpoints]
  [<neutron.services.logapi.rpc.server.LoggingApiSkeleton object at 0x7fb0d465ec10>, <neutron.services.trunk.rpc.server.TrunkSkeleton object at 0x7f622ec11cd0>]

  The RPC servers should be started via start_rpc_listeners()

  Nova has had similar problems with eventlet and logging in the past,
  see here[2][3]. Tests done with Neutron Yoga (or our own brand
  stable/yoga-m3), but issue is present in current master.

  [0] https://github.com/sapcc/neutron/commit/a7c44263b70089d8106bed6d8d5d0e3ddf44d5ad
  [1] https://github.com/sapcc/helm-charts/blob/7a93e91c3af16ad2eb91e0a1d176d56a26faa393/openstack/neutron/templates/etc/_uwsgi.ini.tpl#L46-L50
  [2] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/host.py#L197-L201
  [3] https://github.com/sapcc/nova/blob/f61bd589796f0cd7ae37683de3d676e5edd9cf80/nova/virt/libvirt/migration.py#L406-L407

To manage notifications about this bug go to: