← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1402010] Re: L3 HA state transitions raceful; Metadata proxy not always opened and closed correctly

 

** Changed in: neutron
       Status: Fix Committed => Fix Released

** Changed in: neutron
    Milestone: None => kilo-3

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1402010

Title:
  L3 HA state transitions raceful; Metadata proxy not always opened and
  closed correctly

Status in OpenStack Neutron (virtual network service):
  Fix Released

Bug description:
  L3 HA configures keepalived to invoke a bash script asynchronously
  whenever it performs a state transition. This script writes the new
  state to disk and spawns or destroys the metadata proxy depending on
  the new state. Manual testing has revealed that when keepalived
  changes state twice consecutively, for example, to standby and then to
  master, the standby and master scripts are called in the correct
  order, but may then execute code in their scripts out of order.

  For example, keepalived changes state to standby and then immediately to master.
  notify_standby is executed, followed by notify_master. However, notify_master writes 'master' to disk first, then notify_standby writes 'standby'. Spawning and destroying the metadata proxy may also be performed out of order.

  Currently, the state is written to disk for debugging purposes and so
  the effect of the bug is that a new master may not have the metadata
  proxy up, and that routers going to standby may not shut down the
  proxy.

  Note that the bash notifier scripts will be replaced by Python scripts
  during the Kilo cycle. The Python script will write the new state to
  disk, then inform the agent of a state transition via a Unix domain
  socket. The agent will then manage the metadata proxy and perform
  additional actions. Since bp/report-ha-router-master has not yet been
  merged, this bug will be fixed in the yet-unmerged Python script pre-
  merge as would be expected. Additionally the bug will be fixed in the
  current Bash scripts, and back ported to Juno.

  How to reproduce:
  Three L3 agents, max_l3_agents_per_router = 3. Cause all three router instances to fail by shutting down the HA interface in the router namespaces, then bring two back simultaneously. I successfully reproduce the issue after 5 or 6 times. Perhaps there's an easier reproduction. I haven't been able to automate a failing functional test at this time.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1402010/+subscriptions


References