← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1694505] Re: neutron-ovs-agent dies with return code 0 when neutron-server is down

 

Reviewed:  https://review.openstack.org/469231
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=73701bf75b964509c7d7e8b62dba97f7cbe9c87a
Submitter: Jenkins
Branch:    master

commit 73701bf75b964509c7d7e8b62dba97f7cbe9c87a
Author: Ihar Hrachyshka <ihrachys@xxxxxxxxxx>
Date:   Tue May 30 19:42:16 2017 +0000

    ovs: bubble up failures into main thread in native ofctl mode
    
    When native ofctl interface is used (the default), the agent main() is
    running in a separate gevent thread. Unless we explicitly request from
    ryu to raise errors that may have happened in the agent app, it will
    ignore them (only logging a warning message). This may interfere with
    service management software like systemd that may use the return code to
    decide whether to restart the dead service.
    
    This patch makes ryu raise any uncaught errors happening inside the
    agent. It also makes the agent 'wrapper' helper function not to swallow
    raised exceptions on logging the error. Those two changes combined make
    the agent exit with rc=1 if an exception happens inside the main()
    function when in native mode.
    
    This patch doesn't include any unit tests because those would be very
    silly (like checking that we indeed pass the needed arguments to ryu).
    
    Change-Id: Ic86b5eeae25a916c3c51f21e6820f5b0212dd5f8
    Closes-Bug: #1694505


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1694505

Title:
  neutron-ovs-agent dies with return code 0 when neutron-server is down

Status in neutron:
  Fix Released

Bug description:
  Environment description:

  - Deployment using RDO Trunk repo from master.
  - Neutron based on commit c430e9b

  In neutron-ovs-agent is started before neutron-server starts, it exits
  with return code 0, which is not identified by systemd as a failure so
  it's not restarted.

  following ERRORS appear in /var/log/neutron/openvswitch-agent.log:

  2017-05-30 17:38:48.692 29042 DEBUG neutron.api.rpc.handlers.resources_rpc [req-b5a96471-f0e2-4b24-938c-27ed4d8502c9 - - - - -] neutron.api.rpc.handlers.resources_rpc.ResourcesPullRpcApi met
  hod bulk_pull called with arguments (<neutron_lib.context.Context object at 0x75ff950>, 'Port') {} wrapper /usr/lib/python2.7/site-packages/oslo_log/helpers.py:47
  2017-05-30 17:38:49.298 29042 DEBUG ovsdbapp.backend.ovs_idl.vlog [-] [POLLIN] on fd 12 __log_wakeup /usr/lib/python2.7/site-packages/ovs/poller.py:202

  ....
  2017-05-30 17:40:26.506 29042 DEBUG ovsdbapp.backend.ovs_idl.vlog [-] [POLLIN] on fd 12 __log_wakeup /usr/lib/python2.7/site-packages/ovs/poller.py:202
  2017-05-30 17:40:27.530 29042 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp [req-b5a96471-f0e2-4b24-938c-27ed4d8502c9 - - - - -] Agent main thread died of an exception
  ...
  2017-05-30 17:40:27.530 29042 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp     'to message ID %s' % msg_id)
  2017-05-30 17:40:27.530 29042 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp MessagingTimeout: Timed out waiting for a reply to message ID 3874905892f543e0be9984e6504644bb
  2017-05-30 17:40:27.530 29042 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
  2017-05-30 17:40:27.624 29042 INFO oslo_rootwrap.client [-] Stopping rootwrap daemon process with pid=29502

  From systemd side, following status is reported:

  [root@weirdo1 neutron]# systemctl status neutron-openvswitch-agent
  ● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent
     Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: disabled)
     Active: inactive (dead) since Tue 2017-05-30 17:40:27 UTC; 5min ago
   Main PID: 29042 (code=exited, status=0/SUCCESS)

  May 30 17:38:44 weirdo1 systemd[1]: Starting OpenStack Neutron Open vSwitch Agent...
  May 30 17:38:44 weirdo1 neutron-enable-bridge-firewall.sh[29032]: net.bridge.bridge-nf-call-arptables = 1
  May 30 17:38:44 weirdo1 neutron-enable-bridge-firewall.sh[29032]: net.bridge.bridge-nf-call-iptables = 1
  May 30 17:38:44 weirdo1 neutron-enable-bridge-firewall.sh[29032]: net.bridge.bridge-nf-call-ip6tables = 1
  May 30 17:38:44 weirdo1 systemd[1]: Started OpenStack Neutron Open vSwitch Agent.
  May 30 17:38:45 weirdo1 neutron-openvswitch-agent[29042]: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be reg...te reports.
  May 30 17:38:46 weirdo1 neutron-openvswitch-agent[29042]: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
  May 30 17:38:46 weirdo1 neutron-openvswitch-agent[29042]: Could not load neutron.openstack.common.notifier.rpc_notifier

  
  Note the (code=exited, status=0/SUCCESS)

  
  A easy way to reproduce this is:

  1. Stop neutron-server
  2. Start manually neutron-openvswitch-agent:

  # /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf  --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent
  Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR2 to generate reports.
  Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
  Could not load neutron.openstack.common.notifier.rpc_notifier
  [root@weirdo1 neutron]# echo $?
  0

  Note return code is 0

  
  I'd say this is a bug in ovs agent which should exit with rc!=0 so that systemd service restart it again based on "Restart=on-failure" current policy. Otherwise we should change systemd restart policy.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1694505/+subscriptions


References