yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1594439] Re: Bad initialization sequence in Neutron agents (and maybe somewhere else)

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Rodolfo Alonso <1594439@xxxxxxxxxxxxxxxxxx>
Date: Wed, 19 Oct 2022 11:26:05 -0000
Reply-to: Bug 1594439 <1594439@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Bug closed due to lack of activity, please feel free to reopen if
needed.

** Changed in: neutron
       Status: Confirmed => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1594439

Title:
  Bad initialization sequence in Neutron agents (and maybe somewhere
  else)

Status in neutron:
  Won't Fix

Bug description:
  TL;DR: running threads that modify global state (which can be any RPC
  calls, or any other library calls) in background while forking happens
  may cause weird hard-to-debug errors, so background jobs should be
  initialized after forking (or even better in separate process)

  Currently at least metadata and l3 agents start background threads
  that periodically report state before forking and running main loop in
  child threads. Those threads can modify global state so that it is
  inconsistent in the time of forking which will lead to bad service
  behavior. In the ideal world main process shouldn't do anything except
  managing child processes to avoid any global state leakage into child
  processes (open fds, locked locks, etc).

  This bug was uncovered during investigation into weird seemingly
  unrelated error in grenade job in Ann's CR that was changing to new
  oslo.db EngineFacade [0]. Symptoms were: SSHTimeout while instance's
  cloud-init is busy trying to get metadata and failing. In q-meta logs
  we noticed that there's no INFO messages about incoming HTTP requests,
  but there are some DEBUG ones, which means that requests are coming to
  the agent but are not being responded to. Digging deeper we noticed
  that in the normal operation metadata agent should:

  - receive request;
  - do RPC call to neutron-server;
  - do HTTP request to Nova;
  - send response.

  There were no RPC "CALL" in logs, only CASTs from state reporting and
  the very first one CALL for state reporting.

  Since it's hard to reproduce gate jobs, especially multinode ones,
  we've created another CR [1] that added tracing of every Python LOC to
  see what really happens. You can find the long log with all the
  tracing at [2] or in attachment (logs don't live forever). It
  uncovered the following chain of events:

  - main thread in main process starts background thread for state reporting;
  - some time later that thread starts reporting, wants to do the first CALL (it does CALL once and then it does CASTs);
  - to get 'reply_q' (essentially, connection shared for replies IIUC), it acquires a lock;
  - since there are no connections available at that time (it's the first time RPC is used), oslo_messaging starts connecting to RabbitMQ;
  - background thread yield execution to main thread;
  - main thread forks bunch of WSGI workers;
  - in WSGI workers when requests come, handler tries to do CALL to neutron-server;
  - to get 'reply_q' it tries to acquire a lock, but it is already has been "taken" by background thread in the main process;
  - it hangs forever, which can be seen in Guru Meditation Report.

  There are several problems here (including eventlet having not fork-
  aware locks), but the one that can be fixed to fix them all is to
  start such background threads after all forking happens. I've
  published CR [3] to verify that changing initialization order will fix
  this issue, and it did.

  Note that from what I've been told, forks can still happen if child
  worker unexpectedly dies and main process reforks it. To properly fix
  this issue we should not do anything in the main process that can
  spoil global state that can influence child process. It means that
  we'll need to either run state reporting in a separate process or have
  an isolated oslo_messaging environment (context? IANA expert in
  oslo_messaging) for it.

  [0] https://review.openstack.org/312393
  [1] https://review.openstack.org/331485
  [2] http://logs.openstack.org/85/331485/5/check/gate-grenade-dsvm-neutron-multinode/1c65056/logs/new/screen-q-meta.txt.gz
  [3] https://review.openstack.org/331672

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1594439/+subscriptions

References

[Bug 1594439] [NEW] Bad initialization sequence in Neutron agents (and maybe somewhere else)
From: Yuriy Taraday, 2016-06-20