← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1837635] [NEW] HA router state change from "standby" to "master" should be delayed

 

Public bug reported:

Currently, when a HA state change occurs, the agent execute a series of
actions [1]: updates the metadata proxy, updates the prefix delegation,
executed L3 extension "ha_state_change" methods, updates the radvd
status and notifies this to the server.

When, in a system with more than two routers (one in "active" mode and
the others in "standby"), a switch-over is done, the "keepalived"
process [2] in each "standby" server will set the virtual IP in the HA
interface and advert it. In case that other router HA interface has the
same priority (by default in Neutron, the HA instances of the same
router ID will have the same priority, 50) but higher IP [3], the HA
interface of this instance will have the VIPs and routes deleted and
will become "standby" again. E.g.: [4]

In some cases, we have detected that when the master controller is
rebooted, the change from "standby" to "master" of the other two servers
is detected, but the change from "master" to "standby" of the server
with lower IP (as commented before) is not registered by the server,
because the Neutron server is still not accessible (the master
controller was rebooted). This status change, sometimes, is lost. This
is the situation when both "standby" servers become "master" but the
"master"-"standby" transition of one of them is lost.

1) INITIAL STATUS
(overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True           | :-)   | standby  |
| 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True           | :-)   | standby  |
| edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True           | :-)   | active   |
+--------------------------------------+--------------------------+----------------+-------+----------+

2) CONTROLLER 1 REBOOTED
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True           | :-)   | active   |
| 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True           | :-)   | active   |
| edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+


The aim of this bug is to make public this problem and propose a patch to delay the transition from "standby" to "master" to let keepalived, among all the instances running in the HA servers, to decide which one of them is the "master" server.


[1] https://github.com/openstack/neutron/blob/stable/stein/neutron/agent/l3/ha.py#L115-L134
[2] https://www.keepalived.org/
[3] This method is used by keepalived to define which router is predominant and must be master.
[4] http://paste.openstack.org/show/754760/

** Affects: neutron
     Importance: Undecided
     Assignee: Rodolfo Alonso (rodolfo-alonso-hernandez)
         Status: New

** Changed in: neutron
     Assignee: (unassigned) => Rodolfo Alonso (rodolfo-alonso-hernandez)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1837635

Title:
  HA router state change from "standby" to "master" should be delayed

Status in neutron:
  New

Bug description:
  Currently, when a HA state change occurs, the agent execute a series
  of actions [1]: updates the metadata proxy, updates the prefix
  delegation, executed L3 extension "ha_state_change" methods, updates
  the radvd status and notifies this to the server.

  When, in a system with more than two routers (one in "active" mode and
  the others in "standby"), a switch-over is done, the "keepalived"
  process [2] in each "standby" server will set the virtual IP in the HA
  interface and advert it. In case that other router HA interface has
  the same priority (by default in Neutron, the HA instances of the same
  router ID will have the same priority, 50) but higher IP [3], the HA
  interface of this instance will have the VIPs and routes deleted and
  will become "standby" again. E.g.: [4]

  In some cases, we have detected that when the master controller is
  rebooted, the change from "standby" to "master" of the other two
  servers is detected, but the change from "master" to "standby" of the
  server with lower IP (as commented before) is not registered by the
  server, because the Neutron server is still not accessible (the master
  controller was rebooted). This status change, sometimes, is lost. This
  is the situation when both "standby" servers become "master" but the
  "master"-"standby" transition of one of them is lost.

  1) INITIAL STATUS
  (overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router
  neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | id                                   | host                     | admin_state_up | alive | ha_state |
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True           | :-)   | standby  |
  | 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True           | :-)   | standby  |
  | edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True           | :-)   | active   |
  +--------------------------------------+--------------------------+----------------+-------+----------+

  2) CONTROLLER 1 REBOOTED
  neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | id                                   | host                     | admin_state_up | alive | ha_state |
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | 4056cd8e-e062-4f45-bc83-d3eb51905ff5 | controller-0.localdomain | True           | :-)   | active   |
  | 527d6a6c-8d2e-4796-bbd0-8b41cf365743 | controller-2.localdomain | True           | :-)   | active   |
  | edbdfc1c-3505-4891-8d00-f3a6308bb1de | controller-1.localdomain | True           | :-)   | standby  |
  +--------------------------------------+--------------------------+----------------+-------+----------+

  
  The aim of this bug is to make public this problem and propose a patch to delay the transition from "standby" to "master" to let keepalived, among all the instances running in the HA servers, to decide which one of them is the "master" server.

  
  [1] https://github.com/openstack/neutron/blob/stable/stein/neutron/agent/l3/ha.py#L115-L134
  [2] https://www.keepalived.org/
  [3] This method is used by keepalived to define which router is predominant and must be master.
  [4] http://paste.openstack.org/show/754760/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1837635/+subscriptions


Follow ups