← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1390620] [NEW] Race condition between destroy instance and ovs_neutron_agent

 

Public bug reported:

There's a race condition between the time compute deletes a neutron port
during destroy instance processing and the time the ovs_neutron_agent
rpc_loop notices the ovs port removed and tries to update the associated
neutron port to set the state to DOWN.

In our scenario, controller node is separate from the host, so these
calls come over rest or rpc.

It appears that normally the ovs_neutron_agent wins and the rpc call for
the update of the neutron port happens before compute does the rest api
delete call.  However, once in a while, compute's delete gets in first
and deletes the port before ovs agent tries to update.  In this case,
the update fails, the failure is reported back via rpc, and the ovs
agent then does a full resync on the next iteration in rpc_loop.

In a large scale environment, this is a problem because that resync can
take a very long time due to a very large number of ports to reprocess.
And while that single iteration is occurring (I have seen it take 10
minutes), new deploys start failing because the vif plug event timeout
happens (since the agent will not process the port created for the plug
until next iteration, which could be 10 minutes away, at which point the
deploy has failed and cleaned up the port).

I think the fix for this, which I will create a patch for, is to not
have treat_devices_removed part of the decision to resync.  If we're
removing a port, why do we care that it failed to find the neutron port?
Not sure if there are other considerations to think about....

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1390620

Title:
  Race condition between destroy instance and ovs_neutron_agent

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  There's a race condition between the time compute deletes a neutron
  port during destroy instance processing and the time the
  ovs_neutron_agent rpc_loop notices the ovs port removed and tries to
  update the associated neutron port to set the state to DOWN.

  In our scenario, controller node is separate from the host, so these
  calls come over rest or rpc.

  It appears that normally the ovs_neutron_agent wins and the rpc call
  for the update of the neutron port happens before compute does the
  rest api delete call.  However, once in a while, compute's delete gets
  in first and deletes the port before ovs agent tries to update.  In
  this case, the update fails, the failure is reported back via rpc, and
  the ovs agent then does a full resync on the next iteration in
  rpc_loop.

  In a large scale environment, this is a problem because that resync
  can take a very long time due to a very large number of ports to
  reprocess.  And while that single iteration is occurring (I have seen
  it take 10 minutes), new deploys start failing because the vif plug
  event timeout happens (since the agent will not process the port
  created for the plug until next iteration, which could be 10 minutes
  away, at which point the deploy has failed and cleaned up the port).

  I think the fix for this, which I will create a patch for, is to not
  have treat_devices_removed part of the decision to resync.  If we're
  removing a port, why do we care that it failed to find the neutron
  port?  Not sure if there are other considerations to think about....

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1390620/+subscriptions


Follow ups

References