← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1365453] Re: State of HA routers not exposed via API

 

Removed Related and Closes bug tags from all patches in the series.

** Changed in: neutron
       Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1365453

Title:
  State of HA routers not exposed via API

Status in OpenStack Neutron (virtual network service):
  Invalid

Bug description:
  It's very difficult to know what is the state of a HA router on each
  L3 agent, and who is the master instance and where it is hosted. This
  is a maintenance nightmare. What if I have split brain? What if I want
  to know where is the master so I can manually move it? Currently, the
  only way to know is to SSH into an agent, and:

  cat $state_path/ha_confs/router_id/state

  But this method requires accessing each individual agent. A more user
  friendly way would be to expose this via the API so admins could query
  on a router and get a list of agents and the state on each agent:

  l3-agent-list-hosting-router <router_id>
  Currently shows all of the agents hosting the requested router. It will now also show the HA state on each agent (Which agent is the master, and who are the standbys).

  Implementation choices:
  Keepalived doesn't support a way to query the current VRRP state. The only way to know then is to use notifier scripts. These scripts are executed when a state transition occurs, and receive the new state (Master, backup, fault).

  Every time we reconfigure keepalived (When the router is created and updated) we tell it to execute a Python script (That is maintained as part of the repository). The script will:
  1) Write the new state to a file in $state_path/ha_confs/router_id/state
  2) Start the metadata proxy if the transition was to master, or shut it down if the transition was to backup or fault.
  3) Notify the agent that a transition has occurred via a unix domain socket. The reason that 1 & 2 will happen in the script and not in the agent after it receives the notification is that we want to execute steps 1 & 2 even if the agent is down.

  The L3 agent will batch these state change notifications over a period
  of T seconds. When T seconds have passed and no new notifications have
  arrived it will send a RPC message to the server with a map of router
  ID to VRRP state on that specific agent. Every time the agent starts
  it will perform the full sync from the controller (Get all routers,
  configure them, clean up old namespaces), wait for state transitions
  to die down, read the current state of each router and send an update
  to the controller. The RPC message send will be retried indefinitely
  in case the management network is temporarily down, or the agent is
  disconnected from it.

  The server will then persist this information following the RPC
  message: The tables are already set up for this. Each router has an
  entry in the HA bindings table per agent it is scheduled to, and the
  record contains the VRRP state on that specific agent. No DB migration
  will be necessary.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1365453/+subscriptions


References