← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1823314] [NEW] ha router sometime goes in standby mode in all controllers

 

Public bug reported:

Sometimes when 2 HA routers are created for same tenant in very short
time, it may happen that both routers will have same vr_id assigned thus
it will be same application for keepalived and only one of those routers
will be active on some hosts.

When I spotted it it looked like:

[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-2
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | active   |
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+
[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-1
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+


And in db it looks like:

MariaDB [ovs_neutron]> select * from router_extra_attributes;
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
| router_id                            | distributed | service_router | ha | ha_vr_id | availability_zone_hints |
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
| 6ba430d7-2f9d-4e8e-a59f-4d4fb5644a8e |           0 |              0 |  1 |        1 | []                      |
| ace64e85-5f3b-4815-aeae-3b54c75ef5eb |           0 |              0 |  1 |        1 | []                      |
| cd6b61e1-60c9-47da-8866-169ca29ece20 |           1 |              0 |  0 |        0 | []                      |
+--------------------------------------+-------------+----------------+----+----------+-------------------------+
3 rows in set (0.01 sec)

MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations;
+--------------------------------------+-------+
| network_id                           | vr_id |
+--------------------------------------+-------+
| 45aaae94-ce16-412d-bd74-b3812b16ff6f |     1 |
+--------------------------------------+-------+
1 row in set (0.01 sec)

So indeed there is possible race during such creation of 2 different
routers in very short time.

But when I then created another router, it was created properly with new
vr_id and all worked fine for it:

[stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-3
+--------------------------------------+--------------------------+----------------+-------+----------+
| id                                   | host                     | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------+----------------+-------+----------+
| 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
| 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | active   |
| 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
+--------------------------------------+--------------------------+----------------+-------+----------+

MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations;
+--------------------------------------+-------+
| network_id                           | vr_id |
+--------------------------------------+-------+
| 45aaae94-ce16-412d-bd74-b3812b16ff6f |     1 |
| 45aaae94-ce16-412d-bd74-b3812b16ff6f |     2 |
+--------------------------------------+-------+


I found this bug on old version based on Newton release but from what I saw in https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L109 this code didn't change a lot so I think that the same issue may happen also on newer releases.

** Affects: neutron
     Importance: Undecided
     Assignee: Slawek Kaplonski (slaweq)
         Status: New


** Tags: l3-dvr-backlog

** Changed in: neutron
     Assignee: (unassigned) => Slawek Kaplonski (slaweq)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1823314

Title:
  ha router sometime goes in standby mode in all controllers

Status in neutron:
  New

Bug description:
  Sometimes when 2 HA routers are created for same tenant in very short
  time, it may happen that both routers will have same vr_id assigned
  thus it will be same application for keepalived and only one of those
  routers will be active on some hosts.

  When I spotted it it looked like:

  [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-2
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | id                                   | host                     | admin_state_up | alive | ha_state |
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
  | 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | active   |
  | 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
  +--------------------------------------+--------------------------+----------------+-------+----------+
  [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-1
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | id                                   | host                     | admin_state_up | alive | ha_state |
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
  | 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
  | 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | standby  |
  +--------------------------------------+--------------------------+----------------+-------+----------+

  
  And in db it looks like:

  MariaDB [ovs_neutron]> select * from router_extra_attributes;
  +--------------------------------------+-------------+----------------+----+----------+-------------------------+
  | router_id                            | distributed | service_router | ha | ha_vr_id | availability_zone_hints |
  +--------------------------------------+-------------+----------------+----+----------+-------------------------+
  | 6ba430d7-2f9d-4e8e-a59f-4d4fb5644a8e |           0 |              0 |  1 |        1 | []                      |
  | ace64e85-5f3b-4815-aeae-3b54c75ef5eb |           0 |              0 |  1 |        1 | []                      |
  | cd6b61e1-60c9-47da-8866-169ca29ece20 |           1 |              0 |  0 |        0 | []                      |
  +--------------------------------------+-------------+----------------+----+----------+-------------------------+
  3 rows in set (0.01 sec)

  MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations;
  +--------------------------------------+-------+
  | network_id                           | vr_id |
  +--------------------------------------+-------+
  | 45aaae94-ce16-412d-bd74-b3812b16ff6f |     1 |
  +--------------------------------------+-------+
  1 row in set (0.01 sec)

  So indeed there is possible race during such creation of 2 different
  routers in very short time.

  But when I then created another router, it was created properly with
  new vr_id and all worked fine for it:

  [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router-3
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | id                                   | host                     | admin_state_up | alive | ha_state |
  +--------------------------------------+--------------------------+----------------+-------+----------+
  | 0d654b7c-da42-4847-a24f-6d1df804ca3b | controller-1.localdomain | True           | :-)   | standby  |
  | 242e1e81-7e4e-466e-8354-a9c46982ff88 | controller-0.localdomain | True           | :-)   | active   |
  | 3d241b02-031a-4623-a179-88e1953b3889 | controller-2.localdomain | True           | :-)   | standby  |
  +--------------------------------------+--------------------------+----------------+-------+----------+

  MariaDB [ovs_neutron]> select * from ha_router_vrid_allocations;
  +--------------------------------------+-------+
  | network_id                           | vr_id |
  +--------------------------------------+-------+
  | 45aaae94-ce16-412d-bd74-b3812b16ff6f |     1 |
  | 45aaae94-ce16-412d-bd74-b3812b16ff6f |     2 |
  +--------------------------------------+-------+

  
  I found this bug on old version based on Newton release but from what I saw in https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L109 this code didn't change a lot so I think that the same issue may happen also on newer releases.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1823314/+subscriptions


Follow ups