yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1648242] Re: [SRU] Failure to retry update_ha_routers_states

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Launchpad Bug Tracker <1648242@xxxxxxxxxxxxxxxxxx>
Date: Mon, 15 May 2017 17:45:02 -0000
Reply-to: Bug 1648242 <1648242@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
This bug was fixed in the package neutron - 2:8.4.0-0ubuntu2

---------------
neutron (2:8.4.0-0ubuntu2) xenial; urgency=medium

  [ Edward Hope-Morley ]
  * Backport fix for Failure to retry update_ha_routers_states (LP: #1648242)
    - d/p/add-check-for-ha-state.patch

  [ Chuck Short ]
  * d/neutron-common.install, d/neutron-dhcp-agent.install:
    Remove cron jobs since they will cause a race when
    using an L3 agent. The L3 agent cleans up after itself now.
    (LP: #1623664)

 -- Chuck Short <zulcss@xxxxxxxxxx>  Wed, 19 Apr 2017 11:39:09 +0100

** Changed in: neutron (Ubuntu Xenial)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1648242

Title:
  [SRU] Failure to retry update_ha_routers_states

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Xenial:
  Fix Released

Bug description:
  [Impact]

    Mitigates risk of incorrect ha_state reported by l3-agent for HA
    routers in case where rmq connection is lost during update
    window. Fix is already in Ubuntu for O and N but upstream
    backport just missed the Mitaka PR hence this SRU.

  [Test Case]

    * deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3
  -agents-per-router set to 3

    * configure network, router, boot instance with floating ip and
  start pinging

    * check that status is 1 agent showing active and 2 showing standby

    * trigger some router failovers while rabbit server stopped e.g.

      - go to l3-agent hosting your router and do:

        ip netns exec qrouter-${router} ip link set dev <ha-iface> down

        check other units to see if ha iface has been failed over

        ip netns exec qrouter-${router} ip link set dev <ha-iface> up
     
    * ensure ping still running

    * eventually all agents will be xxx/standby

    * start rabbit server

    * wait for correct ha_state to be set (takes a few seconds)

  [Regression Potential]

   I do not envisage any regression from this patch. One potential side-effect is
   mildy increased rmq traffic but should be negligible.

  
  --------

  Version: Mitaka

  While performing failover testing of L3 HA routers, we've discovered
  an issue with regards to the failure of an agent to report its state.

  In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-
  a3c4d15e7677) scheduled to (3) L3 agents:

  +--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  | id                                   | host                                             | admin_state_up | alive | ha_state |
  +--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True           | :-)   | active  |
  | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True           | :-)   | standby   |
  | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True           | :-)   | standby   |
  +--------------------------------------+--------------------------------------------------+----------------+-------+----------+

  The infra03 node was shut down completely and abruptly. The router
  transitioned to master on infra02 as indicated in these log messages:

  2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device qg-d48918fa-eb already exists
  2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master
  2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464
  2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup
  2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110
  2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master
  2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920
  2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master
  2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848

  Traffic to/from VMs through the new master router functioned as
  expected. However, the ha_state remained 'standby':

  +--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  | id                                   | host                                             | admin_state_up | alive | ha_state |
  +--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True           | xxx   | standby  |
  | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True           | :-)   | standby   |
  | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True           | :-)   | standby   |
  +--------------------------------------+--------------------------------------------------+----------------+-------+----------+

  A traceback was observed in the logs related to a message timeout,
  probably due to the cut of AMQP on infra03:

  2016-12-07 15:22:30.525 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds.
  2016-12-07 15:22:36.537 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds.
  2016-12-07 15:22:37.553 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client
  2016-12-07 15:22:51.210 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying again in 1 seconds.
  2016-12-07 15:22:52.262 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent [-] Failed reporting state!
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent Traceback (most recent call last):
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 686, in _report_state
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     True)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     return method(context, 'report_state', **kwargs)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     retry=self.retry)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     timeout=timeout, retry=retry)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     retry=retry)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     result = self._waiter.wait(msg_id, timeout)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     message = self.waiters.get(msg_id, timeout=timeout)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     'to message ID %s' % msg_id)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 03e3909c25d749908652c050eeeaec91
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent
  2016-12-07 15:22:55.829 18450 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 0.01 sec
  2016-12-07 15:23:01.086 18450 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 03e3909c25d749908652c050eeeaec91
  2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 4 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
  2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing timeout for update_ha_routers_states calls to 120 seconds. Restart the agent to restore it to the default value.
  2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 35 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.

  While subsequent transitions for other routers acted and updated
  Neutron accordingly, this router's ha_state was never updated after
  this initial failure. We could force the update with a restart of the
  L3 agent, or by causing a transition from infra02 to another node, and
  back. Neither is ideal.

  If you need any other information please let me know.

  Thanks,
  James

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1648242/+subscriptions
References

[Bug 1648242] [NEW] Failure to retry update_ha_routers_states
From: James Denton, 2016-12-07