yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #25514
[Bug 1402010] [NEW] L3 HA state transitions raceful; Metadata proxy not always opened and closed correctly
Public bug reported:
L3 HA configures keepalived to invoke a bash script asynchronously
whenever it performs a state transition. This script writes the new
state to disk and spawns or destroys the metadata proxy depending on the
new state. Manual testing has revealed that when keepalived changes
state twice consecutively, for example, to standby and then to master,
the standby and master scripts are called in the correct order, but may
then execute code in their scripts out of order.
For example, keepalived changes state to standby and then immediately to master.
notify_standby is executed, followed by notify_master. However, notify_master writes 'master' to disk first, then notify_standby writes 'standby'. Spawning and destroying the metadata proxy may also be performed out of order.
Currently, the state is written to disk for debugging purposes and so
the effect of the bug is that a new master may not have the metadata
proxy up, and that routers going to standby may not shut down the proxy.
Note that the bash notifier scripts will be replaced by Python scripts
during the Kilo cycle. The Python script will write the new state to
disk, then inform the agent of a state transition via a Unix domain
socket. The agent will then manage the metadata proxy and perform
additional actions. Since bp/report-ha-router-master has not yet been
merged, this bug will be fixed in the yet-unmerged Python script pre-
merge as would be expected. Additionally the bug will be fixed in the
current Bash scripts, and back ported to Juno.
How to reproduce:
Three L3 agents, max_l3_agents_per_router = 3. Cause all three router instances to fail by shutting down the HA interface in the router namespaces, then bring two back simultaneously. I successfully reproduce the issue after 5 or 6 times. Perhaps there's an easier reproduction. I haven't been able to automate a failing functional test at this time.
** Affects: neutron
Importance: Undecided
Status: New
** Tags: l3-ha
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1402010
Title:
L3 HA state transitions raceful; Metadata proxy not always opened and
closed correctly
Status in OpenStack Neutron (virtual network service):
New
Bug description:
L3 HA configures keepalived to invoke a bash script asynchronously
whenever it performs a state transition. This script writes the new
state to disk and spawns or destroys the metadata proxy depending on
the new state. Manual testing has revealed that when keepalived
changes state twice consecutively, for example, to standby and then to
master, the standby and master scripts are called in the correct
order, but may then execute code in their scripts out of order.
For example, keepalived changes state to standby and then immediately to master.
notify_standby is executed, followed by notify_master. However, notify_master writes 'master' to disk first, then notify_standby writes 'standby'. Spawning and destroying the metadata proxy may also be performed out of order.
Currently, the state is written to disk for debugging purposes and so
the effect of the bug is that a new master may not have the metadata
proxy up, and that routers going to standby may not shut down the
proxy.
Note that the bash notifier scripts will be replaced by Python scripts
during the Kilo cycle. The Python script will write the new state to
disk, then inform the agent of a state transition via a Unix domain
socket. The agent will then manage the metadata proxy and perform
additional actions. Since bp/report-ha-router-master has not yet been
merged, this bug will be fixed in the yet-unmerged Python script pre-
merge as would be expected. Additionally the bug will be fixed in the
current Bash scripts, and back ported to Juno.
How to reproduce:
Three L3 agents, max_l3_agents_per_router = 3. Cause all three router instances to fail by shutting down the HA interface in the router namespaces, then bring two back simultaneously. I successfully reproduce the issue after 5 or 6 times. Perhaps there's an easier reproduction. I haven't been able to automate a failing functional test at this time.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1402010/+subscriptions
Follow ups
References