← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1916024] [NEW] HA router master instance in error state because qg-xx interface is down

 

Public bug reported:

BZ reference: https://bugzilla.redhat.com/show_bug.cgi?id=1929829

Sometimes a router is created with all the instances in standby mode
because the qg-xx interface is in down state and there isn't
connectivity:

(overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router1
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+---------------------------+----------------+-------+----------+
| id                                   | host                      | admin_state_up | alive | ha_state |
+--------------------------------------+---------------------------+----------------+-------+----------+
| 3b93ec23-48fa-4847-bbb2-f8903e9865f9 | networker-1.redhat.local  | True           | :-)   | standby  |
| 41b8d1a8-4695-445a-916a-d12db523eb91 | controller-0.redhat.local | True           | :-)   | standby  |
| 4533bd88-d2d1-4320-9e39-6fcb2a5cc236 | networker-0.redhat.local  | True           | :-)   | standby  |
+--------------------------------------+---------------------------+----------------+-------+----------+
(overcloud) [stack@undercloud-0 ~]$ 


Steps to reproduce:
1. for i in $(seq 10); do ./create.sh $i; done
3. Check FIP connectivity to detect the error
4. for i in $(seq 10); do ./delete.sh $i; done

Scripts: http://paste.openstack.org/show/802777/

Seems to be a race condition between L3 agent and keepalived configuring qg-xxx interface:
- /var/log/messages: http://paste.openstack.org/show/802778/
- L3 agent logs: http://paste.openstack.org/show/802779/

When keepalive is setting the qg-xxx interface IP addresses, the
interface disappears from udev and reappears again (I still don't know
why yet). The log in journalctl looks the same as when a new interface
is created.

Since [1], the L3 agent controls the GW interface status (up or down).
If the L3 agent do not link up the interface, the router namespace won't
be able to send/receive any traffic.

[1]https://review.opendev.org/q/I8dca2c1a2f8cb467cfb44420f0eea54ca0932b05

** Affects: neutron
     Importance: Undecided
     Assignee: Rodolfo Alonso (rodolfo-alonso-hernandez)
         Status: New


** Tags: l3-ha

** Changed in: neutron
     Assignee: (unassigned) => Rodolfo Alonso (rodolfo-alonso-hernandez)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1916024

Title:
  HA router master instance in error state because qg-xx interface is
  down

Status in neutron:
  New

Bug description:
  BZ reference: https://bugzilla.redhat.com/show_bug.cgi?id=1929829

  Sometimes a router is created with all the instances in standby mode
  because the qg-xx interface is in down state and there isn't
  connectivity:

  (overcloud) [stack@undercloud-0 ~]$ neutron l3-agent-list-hosting-router router1
  neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
  +--------------------------------------+---------------------------+----------------+-------+----------+
  | id                                   | host                      | admin_state_up | alive | ha_state |
  +--------------------------------------+---------------------------+----------------+-------+----------+
  | 3b93ec23-48fa-4847-bbb2-f8903e9865f9 | networker-1.redhat.local  | True           | :-)   | standby  |
  | 41b8d1a8-4695-445a-916a-d12db523eb91 | controller-0.redhat.local | True           | :-)   | standby  |
  | 4533bd88-d2d1-4320-9e39-6fcb2a5cc236 | networker-0.redhat.local  | True           | :-)   | standby  |
  +--------------------------------------+---------------------------+----------------+-------+----------+
  (overcloud) [stack@undercloud-0 ~]$ 

  
  Steps to reproduce:
  1. for i in $(seq 10); do ./create.sh $i; done
  3. Check FIP connectivity to detect the error
  4. for i in $(seq 10); do ./delete.sh $i; done

  Scripts: http://paste.openstack.org/show/802777/

  Seems to be a race condition between L3 agent and keepalived configuring qg-xxx interface:
  - /var/log/messages: http://paste.openstack.org/show/802778/
  - L3 agent logs: http://paste.openstack.org/show/802779/

  When keepalive is setting the qg-xxx interface IP addresses, the
  interface disappears from udev and reappears again (I still don't know
  why yet). The log in journalctl looks the same as when a new interface
  is created.

  Since [1], the L3 agent controls the GW interface status (up or down).
  If the L3 agent do not link up the interface, the router namespace
  won't be able to send/receive any traffic.

  [1]https://review.opendev.org/q/I8dca2c1a2f8cb467cfb44420f0eea54ca0932b05

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1916024/+subscriptions


Follow ups