← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1372438] [NEW] Race condition in l2pop drops tunnels

 

Public bug reported:

The issue was originally raised by a Red Hat performance engineer (Joe
Talerico)  here: https://bugzilla.redhat.com/show_bug.cgi?id=1136969
(see starting from comment 4).

Joe created a Fedora instance in his OS cloud based on RHEL7-OSP5
(Icehouse), where he installed Rally client to run benchmarks against
that cloud itself. He assigned a floating IP to that instance to be able
to access API endpoints from inside the Rally machine. Then he ran a
scenario which basically started up 100+ new instances in parallel,
tried to access each of them via ssh, and once it succeeded, clean up
each created instance (with its ports). Once in a while, his Rally
instance lost connection to outside world. This was because VXLAN tunnel
to the compute node hosting the Rally machine was dropped on networker
node where DHCP, L3, Metadata agents were running. Once we restarted OVS
agent, the tunnel was recreated properly.

The scenario failed only if L2POP mechanism was enabled.

I've looked thru the OVS agent logs and found out that the tunnel was
dropped due to a legitimate fdb entry removal request coming from
neutron-server side. So the fault is probably on neutron-server side, in
l2pop mechanism driver.

I've then looked thru the patches in Juno to see whether there is
something related to the issue already merged, and found the patch that
gets rid of _precommit step when cleaning up fdb entries. Once we've
applied the patch on the neutron-server node, we stopped to experience
those connectivity failures.

After discussion with Vivekanandan Narasimhan, we came up with the
following race condition that may result in tunnels being dropped while
legitimate resources are still using them:

(quoting Vivek below)

'''
- - port1 delete request comes in;
- - port1 delete request acquires lock
- - port2 create/update request comes in;
- - port2 create/update waits on due to unavailability of lock
- - precommit phase for port1 determines that the port is the last one, so we should drop the FLOODING_ENTRY;
- - port1 delete applied to db;
- - port1 transaction releases the lock
- - port2 create/update acquires the lock
- - precommit phase for port2 determines that the port is the first one, so request FLOODING_ENTRY + MAC-specific flow creation;
- - port2 create/update request applied to db;
- - port2 transaction releases the lock

Now at this point postcommit of either of them could happen, because code-pieces operate outside the
locked zone.  

If it happens, this way, tunnel would retain:

- - postcommit phase for port1 requests FLOODING_ENTRY deletion due to port1 deletion
- - postcommit phase requests FLOODING_ENTRY + MAC-specific flow creation for port2;

If it happens the below way, tunnel would break:
- - postcommit phase for create por2 requests FLOODING_ENTRY + MAC-specific flow 
- - postcommit phase for delete port1 requests FLOODING_ENTRY deletion
'''

We considered the patch to get rid of precommit for backport to Icehouse
[1] that seems to eliminate the race, but we're concerned that we
reverted that to previous behaviour in Juno as part of DVR work [2],
though we haven't done any testing to check whether the issue is present
in Juno (though brief analysis of the code shows that it should fail
there too).

Ideally, the fix for Juno should be easily backportable because the
issue is currently present in Icehouse, and we would like to have the
same fix for both branches (Icehouse and Juno) instead of backporting
patch [1] to Icehouse and implementing another patch for Juno.

[1]: https://review.openstack.org/#/c/95165/
[2]: https://review.openstack.org/#/c/102398/

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1372438

Title:
  Race condition in l2pop drops tunnels

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  The issue was originally raised by a Red Hat performance engineer (Joe
  Talerico)  here: https://bugzilla.redhat.com/show_bug.cgi?id=1136969
  (see starting from comment 4).

  Joe created a Fedora instance in his OS cloud based on RHEL7-OSP5
  (Icehouse), where he installed Rally client to run benchmarks against
  that cloud itself. He assigned a floating IP to that instance to be
  able to access API endpoints from inside the Rally machine. Then he
  ran a scenario which basically started up 100+ new instances in
  parallel, tried to access each of them via ssh, and once it succeeded,
  clean up each created instance (with its ports). Once in a while, his
  Rally instance lost connection to outside world. This was because
  VXLAN tunnel to the compute node hosting the Rally machine was dropped
  on networker node where DHCP, L3, Metadata agents were running. Once
  we restarted OVS agent, the tunnel was recreated properly.

  The scenario failed only if L2POP mechanism was enabled.

  I've looked thru the OVS agent logs and found out that the tunnel was
  dropped due to a legitimate fdb entry removal request coming from
  neutron-server side. So the fault is probably on neutron-server side,
  in l2pop mechanism driver.

  I've then looked thru the patches in Juno to see whether there is
  something related to the issue already merged, and found the patch
  that gets rid of _precommit step when cleaning up fdb entries. Once
  we've applied the patch on the neutron-server node, we stopped to
  experience those connectivity failures.

  After discussion with Vivekanandan Narasimhan, we came up with the
  following race condition that may result in tunnels being dropped
  while legitimate resources are still using them:

  (quoting Vivek below)

  '''
  - - port1 delete request comes in;
  - - port1 delete request acquires lock
  - - port2 create/update request comes in;
  - - port2 create/update waits on due to unavailability of lock
  - - precommit phase for port1 determines that the port is the last one, so we should drop the FLOODING_ENTRY;
  - - port1 delete applied to db;
  - - port1 transaction releases the lock
  - - port2 create/update acquires the lock
  - - precommit phase for port2 determines that the port is the first one, so request FLOODING_ENTRY + MAC-specific flow creation;
  - - port2 create/update request applied to db;
  - - port2 transaction releases the lock

  Now at this point postcommit of either of them could happen, because code-pieces operate outside the
  locked zone.  

  If it happens, this way, tunnel would retain:

  - - postcommit phase for port1 requests FLOODING_ENTRY deletion due to port1 deletion
  - - postcommit phase requests FLOODING_ENTRY + MAC-specific flow creation for port2;

  If it happens the below way, tunnel would break:
  - - postcommit phase for create por2 requests FLOODING_ENTRY + MAC-specific flow 
  - - postcommit phase for delete port1 requests FLOODING_ENTRY deletion
  '''

  We considered the patch to get rid of precommit for backport to
  Icehouse [1] that seems to eliminate the race, but we're concerned
  that we reverted that to previous behaviour in Juno as part of DVR
  work [2], though we haven't done any testing to check whether the
  issue is present in Juno (though brief analysis of the code shows that
  it should fail there too).

  Ideally, the fix for Juno should be easily backportable because the
  issue is currently present in Icehouse, and we would like to have the
  same fix for both branches (Icehouse and Juno) instead of backporting
  patch [1] to Icehouse and implementing another patch for Juno.

  [1]: https://review.openstack.org/#/c/95165/
  [2]: https://review.openstack.org/#/c/102398/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1372438/+subscriptions


Follow ups

References