yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1365453] [NEW] It's very difficult to know what is the state of a HA router on each L3 agent

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Assaf Muller <amuller@xxxxxxxxxx>
Date: Thu, 04 Sep 2014 11:37:50 -0000
Reply-to: Bug 1365453 <1365453@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

It's very difficult to know what is the state of a HA router on each L3
agent, and who is the master instance and where it is hosted. This is a
maintenance nightmare. Currently, the only way to know is to SSH into an
agent, and:

cat $state_path/ha_confs/router_id/state

But this method requires accessing each individual agent. A more user
friendly way would be to expose this via the API so admins could query
on a router and get a list of agents and the state on each agent:

router-show <router_id>
Would show a list of agents the router is scheduled to, and the state of the router on each agent.

This is harder than it sounds and requires a few design decisions.

Keepalived doesn't support a way to query the current VRRP state. The
only way to know then is to use notifier scripts. These scripts are
executed when a state transition occurs, and receive the new state
(Master, backup, fault). Every time we reconfigure keepalived (When the
router is created and updated) we write a bash executable with the
router ID and VRRP state. This is the script that we configure
keepalived to execute. The bash script then passes these two parameters
to a Python executable that passes the information via a Unix domain
socket to the L3 agent (I expect the Python script to grow). The L3
agent will batch these state change notifications over a period of T
seconds. When T seconds have passed and no new notification has arrived
it will send a RPC message to the server with a map of router ID to VRRP
state on that specific agent. If the agent crashes after a notification
has been queued but before it's been sent, we'll also write each
notification to disk, and when the L3 agent starts it will en-queue all
of these notifications and remove them from the disk.

The server will then persist this information following the RPC message:
The tables are already set up for this. Each router has an entry in the
HA bindings table per agent it is scheduled to, and the record contains
the VRRP state on that specific agent. The API response for this router
will now also contain a dict of {agent_id: VRRP state}.

** Affects: neutron
Importance: Undecided
Status: New

** Tags: l3-ha

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1365453

Title:
It's very difficult to know what is the state of a HA router on each
L3 agent

Status in OpenStack Neutron (virtual network service):
New

Bug description:
It's very difficult to know what is the state of a HA router on each
L3 agent, and who is the master instance and where it is hosted. This
is a maintenance nightmare. Currently, the only way to know is to SSH
into an agent, and:

cat $state_path/ha_confs/router_id/state

router-show <router_id>
Would show a list of agents the router is scheduled to, and the state of the router on each agent.

This is harder than it sounds and requires a few design decisions.

Keepalived doesn't support a way to query the current VRRP state. The
only way to know then is to use notifier scripts. These scripts are
executed when a state transition occurs, and receive the new state
(Master, backup, fault). Every time we reconfigure keepalived (When
the router is created and updated) we write a bash executable with the
router ID and VRRP state. This is the script that we configure
keepalived to execute. The bash script then passes these two
parameters to a Python executable that passes the information via a
Unix domain socket to the L3 agent (I expect the Python script to
grow). The L3 agent will batch these state change notifications over a
period of T seconds. When T seconds have passed and no new
notification has arrived it will send a RPC message to the server with
a map of router ID to VRRP state on that specific agent. If the agent
crashes after a notification has been queued but before it's been
sent, we'll also write each notification to disk, and when the L3
agent starts it will en-queue all of these notifications and remove
them from the disk.

The server will then persist this information following the RPC
message: The tables are already set up for this. Each router has an
entry in the HA bindings table per agent it is scheduled to, and the
record contains the VRRP state on that specific agent. The API
response for this router will now also contain a dict of {agent_id:
VRRP state}.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1365453/+subscriptions

Follow ups

[Bug 1365453] Re: State of HA routers not exposed via API
From: Assaf Muller, 2014-12-07
[Bug 1365453] Re: State of HA routers not exposed via API
From: OpenStack Infra, 2014-12-07
[Bug 1365453] Re: State of HA routers not exposed via API
From: Assaf Muller, 2014-11-21
[Bug 1365453] [NEW] It's very difficult to know what is the state of a HA router on each L3 agent
From: Assaf Muller, 2014-09-04

References

[Bug 1365453] [NEW] It's very difficult to know what is the state of a HA router on each L3 agent
From: Assaf Muller, 2014-09-04