yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1372438] Re: Race condition in l2pop drops tunnels

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Alan Pevec <1372438@xxxxxxxxxxxxxxxxxx>
Date: Thu, 12 Mar 2015 23:21:20 -0000
Reply-to: Bug 1372438 <1372438@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** Also affects: neutron/icehouse
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1372438

Title:
  Race condition in l2pop drops tunnels

Status in OpenStack Neutron (virtual network service):
  Fix Released
Status in neutron icehouse series:
  New
Status in neutron juno series:
  Fix Released

Bug description:
  The issue was originally raised by a Red Hat performance engineer (Joe
  Talerico)  here: https://bugzilla.redhat.com/show_bug.cgi?id=1136969
  (see starting from comment 4).

  Joe created a Fedora instance in his OS cloud based on RHEL7-OSP5
  (Icehouse), where he installed Rally client to run benchmarks against
  that cloud itself. He assigned a floating IP to that instance to be
  able to access API endpoints from inside the Rally machine. Then he
  ran a scenario which basically started up 100+ new instances in
  parallel, tried to access each of them via ssh, and once it succeeded,
  clean up each created instance (with its ports). Once in a while, his
  Rally instance lost connection to outside world. This was because
  VXLAN tunnel to the compute node hosting the Rally machine was dropped
  on networker node where DHCP, L3, Metadata agents were running. Once
  we restarted OVS agent, the tunnel was recreated properly.

  The scenario failed only if L2POP mechanism was enabled.

  I've looked thru the OVS agent logs and found out that the tunnel was
  dropped due to a legitimate fdb entry removal request coming from
  neutron-server side. So the fault is probably on neutron-server side,
  in l2pop mechanism driver.

  I've then looked thru the patches in Juno to see whether there is
  something related to the issue already merged, and found the patch
  that gets rid of _precommit step when cleaning up fdb entries. Once
  we've applied the patch on the neutron-server node, we stopped to
  experience those connectivity failures.

  After discussion with Vivekanandan Narasimhan, we came up with the
  following race condition that may result in tunnels being dropped
  while legitimate resources are still using them:

  (quoting Vivek below)

  '''
  - - port1 delete request comes in;
  - - port1 delete request acquires lock
  - - port2 create/update request comes in;
  - - port2 create/update waits on due to unavailability of lock
  - - precommit phase for port1 determines that the port is the last one, so we should drop the FLOODING_ENTRY;
  - - port1 delete applied to db;
  - - port1 transaction releases the lock
  - - port2 create/update acquires the lock
  - - precommit phase for port2 determines that the port is the first one, so request FLOODING_ENTRY + MAC-specific flow creation;
  - - port2 create/update request applied to db;
  - - port2 transaction releases the lock

  Now at this point postcommit of either of them could happen, because code-pieces operate outside the
  locked zone.  

  If it happens, this way, tunnel would retain:

  - - postcommit phase for port1 requests FLOODING_ENTRY deletion due to port1 deletion
  - - postcommit phase requests FLOODING_ENTRY + MAC-specific flow creation for port2;

  If it happens the below way, tunnel would break:
  - - postcommit phase for create por2 requests FLOODING_ENTRY + MAC-specific flow 
  - - postcommit phase for delete port1 requests FLOODING_ENTRY deletion
  '''

  We considered the patch to get rid of precommit for backport to
  Icehouse [1] that seems to eliminate the race, but we're concerned
  that we reverted that to previous behaviour in Juno as part of DVR
  work [2], though we haven't done any testing to check whether the
  issue is present in Juno (though brief analysis of the code shows that
  it should fail there too).

  Ideally, the fix for Juno should be easily backportable because the
  issue is currently present in Icehouse, and we would like to have the
  same fix for both branches (Icehouse and Juno) instead of backporting
  patch [1] to Icehouse and implementing another patch for Juno.

  [1]: https://review.openstack.org/#/c/95165/
  [2]: https://review.openstack.org/#/c/102398/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1372438/+subscriptions
References

[Bug 1372438] [NEW] Race condition in l2pop drops tunnels
From: Ihar Hrachyshka, 2014-09-22