← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1836023] [NEW] OVS agent "hangs" while processing trusted ports

 

Public bug reported:

Queens, ovsdb native interface.

On a loaded gtw node hosting > 1000 ports when restarting neutron-
openvswitch-agent at some moment agent stops sending state reports and
do any logging for a significant time, depending on number of ports. In
our case gtw node hosts > 1400 ports and agent hangs for ~100 seconds.
Thus if configured agent_down_time is less that 100 seconds, neutron
server sees agent as down, starts resources rescheduling. After agent
stops hanging it sees itself as "revived" and starts new full sync. This
loop is almost endless.

Debug showed the culprit is process_trusted_ports:
https://github.com/openstack/neutron/blob/13.0.4/neutron/agent/linux/openvswitch_firewall/firewall.py#L655
- this func does not yield control to other greenthreads and blocks
until all trusted ports are processed. Since on gateway nodes almost al
ports are "trusted" (router and dhcp ports) process_trusted_ports may
take significant time.

The proposal would be to add greenlet.sleep(0) inside loop in
process_trusted_ports - that fixed the issue on our environment.

** Affects: neutron
     Importance: High
     Assignee: Oleg Bondarev (obondarev)
         Status: In Progress


** Tags: ovs-fw

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1836023

Title:
  OVS agent "hangs" while processing trusted ports

Status in neutron:
  In Progress

Bug description:
  Queens, ovsdb native interface.

  On a loaded gtw node hosting > 1000 ports when restarting neutron-
  openvswitch-agent at some moment agent stops sending state reports and
  do any logging for a significant time, depending on number of ports.
  In our case gtw node hosts > 1400 ports and agent hangs for ~100
  seconds. Thus if configured agent_down_time is less that 100 seconds,
  neutron server sees agent as down, starts resources rescheduling.
  After agent stops hanging it sees itself as "revived" and starts new
  full sync. This loop is almost endless.

  Debug showed the culprit is process_trusted_ports:
  https://github.com/openstack/neutron/blob/13.0.4/neutron/agent/linux/openvswitch_firewall/firewall.py#L655
  - this func does not yield control to other greenthreads and blocks
  until all trusted ports are processed. Since on gateway nodes almost
  al ports are "trusted" (router and dhcp ports) process_trusted_ports
  may take significant time.

  The proposal would be to add greenlet.sleep(0) inside loop in
  process_trusted_ports - that fixed the issue on our environment.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1836023/+subscriptions


Follow ups