← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1402010] [NEW] L3 HA state transitions raceful; Metadata proxy not always opened and closed correctly

 

Public bug reported:

L3 HA configures keepalived to invoke a bash script asynchronously
whenever it performs a state transition. This script writes the new
state to disk and spawns or destroys the metadata proxy depending on the
new state. Manual testing has revealed that when keepalived changes
state twice consecutively, for example, to standby and then to master,
the standby and master scripts are called in the correct order, but may
then execute code in their scripts out of order.

For example, keepalived changes state to standby and then immediately to master.
notify_standby is executed, followed by notify_master. However, notify_master writes 'master' to disk first, then notify_standby writes 'standby'. Spawning and destroying the metadata proxy may also be performed out of order.

Currently, the state is written to disk for debugging purposes and so
the effect of the bug is that a new master may not have the metadata
proxy up, and that routers going to standby may not shut down the proxy.

Note that the bash notifier scripts will be replaced by Python scripts
during the Kilo cycle. The Python script will write the new state to
disk, then inform the agent of a state transition via a Unix domain
socket. The agent will then manage the metadata proxy and perform
additional actions. Since bp/report-ha-router-master has not yet been
merged, this bug will be fixed in the yet-unmerged Python script pre-
merge as would be expected. Additionally the bug will be fixed in the
current Bash scripts, and back ported to Juno.

How to reproduce:
Three L3 agents, max_l3_agents_per_router = 3. Cause all three router instances to fail by shutting down the HA interface in the router namespaces, then bring two back simultaneously. I successfully reproduce the issue after 5 or 6 times. Perhaps there's an easier reproduction. I haven't been able to automate a failing functional test at this time.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: l3-ha

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1402010

Title:
  L3 HA state transitions raceful; Metadata proxy not always opened and
  closed correctly

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  L3 HA configures keepalived to invoke a bash script asynchronously
  whenever it performs a state transition. This script writes the new
  state to disk and spawns or destroys the metadata proxy depending on
  the new state. Manual testing has revealed that when keepalived
  changes state twice consecutively, for example, to standby and then to
  master, the standby and master scripts are called in the correct
  order, but may then execute code in their scripts out of order.

  For example, keepalived changes state to standby and then immediately to master.
  notify_standby is executed, followed by notify_master. However, notify_master writes 'master' to disk first, then notify_standby writes 'standby'. Spawning and destroying the metadata proxy may also be performed out of order.

  Currently, the state is written to disk for debugging purposes and so
  the effect of the bug is that a new master may not have the metadata
  proxy up, and that routers going to standby may not shut down the
  proxy.

  Note that the bash notifier scripts will be replaced by Python scripts
  during the Kilo cycle. The Python script will write the new state to
  disk, then inform the agent of a state transition via a Unix domain
  socket. The agent will then manage the metadata proxy and perform
  additional actions. Since bp/report-ha-router-master has not yet been
  merged, this bug will be fixed in the yet-unmerged Python script pre-
  merge as would be expected. Additionally the bug will be fixed in the
  current Bash scripts, and back ported to Juno.

  How to reproduce:
  Three L3 agents, max_l3_agents_per_router = 3. Cause all three router instances to fail by shutting down the HA interface in the router namespaces, then bring two back simultaneously. I successfully reproduce the issue after 5 or 6 times. Perhaps there's an easier reproduction. I haven't been able to automate a failing functional test at this time.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1402010/+subscriptions


Follow ups

References