← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1365453] [NEW] It's very difficult to know what is the state of a HA router on each L3 agent

 

Public bug reported:

It's very difficult to know what is the state of a HA router on each L3
agent, and who is the master instance and where it is hosted. This is a
maintenance nightmare. Currently, the only way to know is to SSH into an
agent, and:

cat $state_path/ha_confs/router_id/state

But this method requires accessing each individual agent. A more user
friendly way would be to expose this via the API so admins could query
on a router and get a list of agents and the state on each agent:

router-show <router_id>
Would show a list of agents the router is scheduled to, and the state of the router on each agent.

This is harder than it sounds and requires a few design decisions.

Keepalived doesn't support a way to query the current VRRP state. The
only way to know then is to use notifier scripts. These scripts are
executed when a state transition occurs, and receive the new state
(Master, backup, fault). Every time we reconfigure keepalived (When the
router is created and updated) we write a bash executable with the
router ID and VRRP state. This is the script that we configure
keepalived to execute. The bash script then passes these two parameters
to a Python executable that passes the information via a Unix domain
socket to the L3 agent (I expect the Python script to grow). The L3
agent will batch these state change notifications over a period of T
seconds. When T seconds have passed and no new notification has arrived
it will send a RPC message to the server with a map of router ID to VRRP
state on that specific agent. If the agent crashes after a notification
has been queued but before it's been sent, we'll also write each
notification to disk, and when the L3 agent starts it will en-queue all
of these notifications and remove them from the disk.

The server will then persist this information following the RPC message:
The tables are already set up for this. Each router has an entry in the
HA bindings table per agent it is scheduled to, and the record contains
the VRRP state on that specific agent. The API response for this router
will now also contain a dict of {agent_id: VRRP state}.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: l3-ha

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1365453

Title:
  It's very difficult to know what is the state of a HA router on each
  L3 agent

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  It's very difficult to know what is the state of a HA router on each
  L3 agent, and who is the master instance and where it is hosted. This
  is a maintenance nightmare. Currently, the only way to know is to SSH
  into an agent, and:

  cat $state_path/ha_confs/router_id/state

  But this method requires accessing each individual agent. A more user
  friendly way would be to expose this via the API so admins could query
  on a router and get a list of agents and the state on each agent:

  router-show <router_id>
  Would show a list of agents the router is scheduled to, and the state of the router on each agent.

  This is harder than it sounds and requires a few design decisions.

  Keepalived doesn't support a way to query the current VRRP state. The
  only way to know then is to use notifier scripts. These scripts are
  executed when a state transition occurs, and receive the new state
  (Master, backup, fault). Every time we reconfigure keepalived (When
  the router is created and updated) we write a bash executable with the
  router ID and VRRP state. This is the script that we configure
  keepalived to execute. The bash script then passes these two
  parameters to a Python executable that passes the information via a
  Unix domain socket to the L3 agent (I expect the Python script to
  grow). The L3 agent will batch these state change notifications over a
  period of T seconds. When T seconds have passed and no new
  notification has arrived it will send a RPC message to the server with
  a map of router ID to VRRP state on that specific agent. If the agent
  crashes after a notification has been queued but before it's been
  sent, we'll also write each notification to disk, and when the L3
  agent starts it will en-queue all of these notifications and remove
  them from the disk.

  The server will then persist this information following the RPC
  message: The tables are already set up for this. Each router has an
  entry in the HA bindings table per agent it is scheduled to, and the
  record contains the VRRP state on that specific agent. The API
  response for this router will now also contain a dict of {agent_id:
  VRRP state}.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1365453/+subscriptions


Follow ups

References