yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1365453] Re: State of HA routers not exposed via API

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Assaf Muller <amuller@xxxxxxxxxx>
Date: Sun, 07 Dec 2014 21:39:58 -0000
Reply-to: Bug 1365453 <1365453@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Removed Related and Closes bug tags from all patches in the series.

** Changed in: neutron
Status: In Progress => Invalid

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1365453

Title:
State of HA routers not exposed via API

Status in OpenStack Neutron (virtual network service):
Invalid

Bug description:
It's very difficult to know what is the state of a HA router on each
L3 agent, and who is the master instance and where it is hosted. This
is a maintenance nightmare. What if I have split brain? What if I want
to know where is the master so I can manually move it? Currently, the
only way to know is to SSH into an agent, and:

cat $state_path/ha_confs/router_id/state

But this method requires accessing each individual agent. A more user
friendly way would be to expose this via the API so admins could query
on a router and get a list of agents and the state on each agent:

l3-agent-list-hosting-router <router_id>
Currently shows all of the agents hosting the requested router. It will now also show the HA state on each agent (Which agent is the master, and who are the standbys).

Implementation choices:
Keepalived doesn't support a way to query the current VRRP state. The only way to know then is to use notifier scripts. These scripts are executed when a state transition occurs, and receive the new state (Master, backup, fault).

Every time we reconfigure keepalived (When the router is created and updated) we tell it to execute a Python script (That is maintained as part of the repository). The script will:
1) Write the new state to a file in $state_path/ha_confs/router_id/state
2) Start the metadata proxy if the transition was to master, or shut it down if the transition was to backup or fault.
3) Notify the agent that a transition has occurred via a unix domain socket. The reason that 1 & 2 will happen in the script and not in the agent after it receives the notification is that we want to execute steps 1 & 2 even if the agent is down.

The L3 agent will batch these state change notifications over a period
of T seconds. When T seconds have passed and no new notifications have
arrived it will send a RPC message to the server with a map of router
ID to VRRP state on that specific agent. Every time the agent starts
it will perform the full sync from the controller (Get all routers,
configure them, clean up old namespaces), wait for state transitions
to die down, read the current state of each router and send an update
to the controller. The RPC message send will be retried indefinitely
in case the management network is temporarily down, or the agent is
disconnected from it.

The server will then persist this information following the RPC
message: The tables are already set up for this. Each router has an
entry in the HA bindings table per agent it is scheduled to, and the
record contains the VRRP state on that specific agent. No DB migration
will be necessary.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1365453/+subscriptions

References

[Bug 1365453] [NEW] It's very difficult to know what is the state of a HA router on each L3 agent
From: Assaf Muller, 2014-09-04