← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1525901] [NEW] Agents report as started before neutron recognizes as active

 

Public bug reported:

In HA, there is a potential race condition between the openvswitch agent
and other agents that "own", depend on or manipulate ports. As the
neutron server resumes on a failover it will not immediately be aware of
openvswitch agents that have also been activated on failover and act as
though there are no active openvswitch agents (this is an example, it
most likely affects other L2 agents). If an agent such as the L3 agent
starts and begins resync before the neutron server is aware of the
active openvswitch agent, ports for the routers on that agent will be
marked as "binding_failed". Currently this is a "terminal" state for the
port as neutron does not attempt to rebind failed bindings on the same
host.

Unfortunately, the neutron agents do not provide even a best-effort
deterministic indication to the outside service manager (systemd,
pacemaker, etc...) that it has fully initialized and the neutron server
should be aware that it is active. Agents should follow the same pattern
as wsgi based services and notify systemd after it can be reasonably
assumed that the neutron server should be aware that it is alive. That
way service startup order logic or constraints can properly start an
agent that is dependent on other agents *after* neutron should be aware
that the required agents are active.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1525901

Title:
  Agents report as started before neutron recognizes as active

Status in neutron:
  New

Bug description:
  In HA, there is a potential race condition between the openvswitch
  agent and other agents that "own", depend on or manipulate ports. As
  the neutron server resumes on a failover it will not immediately be
  aware of openvswitch agents that have also been activated on failover
  and act as though there are no active openvswitch agents (this is an
  example, it most likely affects other L2 agents). If an agent such as
  the L3 agent starts and begins resync before the neutron server is
  aware of the active openvswitch agent, ports for the routers on that
  agent will be marked as "binding_failed". Currently this is a
  "terminal" state for the port as neutron does not attempt to rebind
  failed bindings on the same host.

  Unfortunately, the neutron agents do not provide even a best-effort
  deterministic indication to the outside service manager (systemd,
  pacemaker, etc...) that it has fully initialized and the neutron
  server should be aware that it is active. Agents should follow the
  same pattern as wsgi based services and notify systemd after it can be
  reasonably assumed that the neutron server should be aware that it is
  alive. That way service startup order logic or constraints can
  properly start an agent that is dependent on other agents *after*
  neutron should be aware that the required agents are active.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1525901/+subscriptions


Follow ups