yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1594439] [NEW] Bad initialization sequence in Neutron agents (and maybe somewhere else)

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Yuriy Taraday <1594439@xxxxxxxxxxxxxxxxxx>
Date: Mon, 20 Jun 2016 16:02:00 -0000
Reply-to: Bug 1594439 <1594439@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

TL;DR: running threads that modify global state (which can be any RPC
calls, or any other library calls) in background while forking happens
may cause weird hard-to-debug errors, so background jobs should be
initialized after forking (or even better in separate process)

Currently at least metadata and l3 agents start background threads that
periodically report state before forking and running main loop in child
threads. Those threads can modify global state so that it is
inconsistent in the time of forking which will lead to bad service
behavior. In the ideal world main process shouldn't do anything except
managing child processes to avoid any global state leakage into child
processes (open fds, locked locks, etc).

This bug was uncovered during investigation into weird seemingly
unrelated error in grenade job in Ann's CR that was changing to new
oslo.db EngineFacade [0]. Symptoms were: SSHTimeout while instance's
cloud-init is busy trying to get metadata and failing. In q-meta logs we
noticed that there's no INFO messages about incoming HTTP requests, but
there are some DEBUG ones, which means that requests are coming to the
agent but are not being responded to. Digging deeper we noticed that in
the normal operation metadata agent should:

- receive request;
- do RPC call to neutron-server;
- do HTTP request to Nova;
- send response.

There were no RPC "CALL" in logs, only CASTs from state reporting and
the very first one CALL for state reporting.

Since it's hard to reproduce gate jobs, especially multinode ones, we've
created another CR [1] that added tracing of every Python LOC to see
what really happens. You can find the long log with all the tracing at
[2] or in attachment (logs don't live forever). It uncovered the
following chain of events:

- main thread in main process starts background thread for state reporting;
- some time later that thread starts reporting, wants to do the first CALL (it does CALL once and then it does CASTs);
- to get 'reply_q' (essentially, connection shared for replies IIUC), it acquires a lock;
- since there are no connections available at that time (it's the first time RPC is used), oslo_messaging starts connecting to RabbitMQ;
- background thread yield execution to main thread;
- main thread forks bunch of WSGI workers;
- in WSGI workers when requests come, handler tries to do CALL to neutron-server;
- to get 'reply_q' it tries to acquire a lock, but it is already has been "taken" by background thread in the main process;
- it hangs forever, which can be seen in Guru Meditation Report.

There are several problems here (including eventlet having not fork-
aware locks), but the one that can be fixed to fix them all is to start
such background threads after all forking happens. I've published CR [3]
to verify that changing initialization order will fix this issue, and it
did.

Note that from what I've been told, forks can still happen if child
worker unexpectedly dies and main process reforks it. To properly fix
this issue we should not do anything in the main process that can spoil
global state that can influence child process. It means that we'll need
to either run state reporting in a separate process or have an isolated
oslo_messaging environment (context? IANA expert in oslo_messaging) for
it.

[0] https://review.openstack.org/312393
[1] https://review.openstack.org/331485
[2] http://logs.openstack.org/85/331485/5/check/gate-grenade-dsvm-neutron-multinode/1c65056/logs/new/screen-q-meta.txt.gz
[3] https://review.openstack.org/331672

** Affects: neutron
     Importance: Undecided
     Assignee: Yuriy Taraday (yorik-sar)
         Status: In Progress

** Attachment added: "Logs with tracing"
   https://bugs.launchpad.net/bugs/1594439/+attachment/4687361/+files/screen-q-meta.txt.gz

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1594439

Title:
  Bad initialization sequence in Neutron agents (and maybe somewhere
  else)

Status in neutron:
  In Progress

Bug description:
  TL;DR: running threads that modify global state (which can be any RPC
  calls, or any other library calls) in background while forking happens
  may cause weird hard-to-debug errors, so background jobs should be
  initialized after forking (or even better in separate process)

  Currently at least metadata and l3 agents start background threads
  that periodically report state before forking and running main loop in
  child threads. Those threads can modify global state so that it is
  inconsistent in the time of forking which will lead to bad service
  behavior. In the ideal world main process shouldn't do anything except
  managing child processes to avoid any global state leakage into child
  processes (open fds, locked locks, etc).

  This bug was uncovered during investigation into weird seemingly
  unrelated error in grenade job in Ann's CR that was changing to new
  oslo.db EngineFacade [0]. Symptoms were: SSHTimeout while instance's
  cloud-init is busy trying to get metadata and failing. In q-meta logs
  we noticed that there's no INFO messages about incoming HTTP requests,
  but there are some DEBUG ones, which means that requests are coming to
  the agent but are not being responded to. Digging deeper we noticed
  that in the normal operation metadata agent should:

  - receive request;
  - do RPC call to neutron-server;
  - do HTTP request to Nova;
  - send response.

  There were no RPC "CALL" in logs, only CASTs from state reporting and
  the very first one CALL for state reporting.

  Since it's hard to reproduce gate jobs, especially multinode ones,
  we've created another CR [1] that added tracing of every Python LOC to
  see what really happens. You can find the long log with all the
  tracing at [2] or in attachment (logs don't live forever). It
  uncovered the following chain of events:

  - main thread in main process starts background thread for state reporting;
  - some time later that thread starts reporting, wants to do the first CALL (it does CALL once and then it does CASTs);
  - to get 'reply_q' (essentially, connection shared for replies IIUC), it acquires a lock;
  - since there are no connections available at that time (it's the first time RPC is used), oslo_messaging starts connecting to RabbitMQ;
  - background thread yield execution to main thread;
  - main thread forks bunch of WSGI workers;
  - in WSGI workers when requests come, handler tries to do CALL to neutron-server;
  - to get 'reply_q' it tries to acquire a lock, but it is already has been "taken" by background thread in the main process;
  - it hangs forever, which can be seen in Guru Meditation Report.

  There are several problems here (including eventlet having not fork-
  aware locks), but the one that can be fixed to fix them all is to
  start such background threads after all forking happens. I've
  published CR [3] to verify that changing initialization order will fix
  this issue, and it did.

  Note that from what I've been told, forks can still happen if child
  worker unexpectedly dies and main process reforks it. To properly fix
  this issue we should not do anything in the main process that can
  spoil global state that can influence child process. It means that
  we'll need to either run state reporting in a separate process or have
  an isolated oslo_messaging environment (context? IANA expert in
  oslo_messaging) for it.

  [0] https://review.openstack.org/312393
  [1] https://review.openstack.org/331485
  [2] http://logs.openstack.org/85/331485/5/check/gate-grenade-dsvm-neutron-multinode/1c65056/logs/new/screen-q-meta.txt.gz
  [3] https://review.openstack.org/331672

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1594439/+subscriptions
Follow ups

[Bug 1594439] Re: Bad initialization sequence in Neutron agents (and maybe somewhere else)
From: Rodolfo Alonso, 2022-10-19