yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #21899
[Bug 1372438] [NEW] Race condition in l2pop drops tunnels
Public bug reported:
The issue was originally raised by a Red Hat performance engineer (Joe
Talerico) here: https://bugzilla.redhat.com/show_bug.cgi?id=1136969
(see starting from comment 4).
Joe created a Fedora instance in his OS cloud based on RHEL7-OSP5
(Icehouse), where he installed Rally client to run benchmarks against
that cloud itself. He assigned a floating IP to that instance to be able
to access API endpoints from inside the Rally machine. Then he ran a
scenario which basically started up 100+ new instances in parallel,
tried to access each of them via ssh, and once it succeeded, clean up
each created instance (with its ports). Once in a while, his Rally
instance lost connection to outside world. This was because VXLAN tunnel
to the compute node hosting the Rally machine was dropped on networker
node where DHCP, L3, Metadata agents were running. Once we restarted OVS
agent, the tunnel was recreated properly.
The scenario failed only if L2POP mechanism was enabled.
I've looked thru the OVS agent logs and found out that the tunnel was
dropped due to a legitimate fdb entry removal request coming from
neutron-server side. So the fault is probably on neutron-server side, in
l2pop mechanism driver.
I've then looked thru the patches in Juno to see whether there is
something related to the issue already merged, and found the patch that
gets rid of _precommit step when cleaning up fdb entries. Once we've
applied the patch on the neutron-server node, we stopped to experience
those connectivity failures.
After discussion with Vivekanandan Narasimhan, we came up with the
following race condition that may result in tunnels being dropped while
legitimate resources are still using them:
(quoting Vivek below)
'''
- - port1 delete request comes in;
- - port1 delete request acquires lock
- - port2 create/update request comes in;
- - port2 create/update waits on due to unavailability of lock
- - precommit phase for port1 determines that the port is the last one, so we should drop the FLOODING_ENTRY;
- - port1 delete applied to db;
- - port1 transaction releases the lock
- - port2 create/update acquires the lock
- - precommit phase for port2 determines that the port is the first one, so request FLOODING_ENTRY + MAC-specific flow creation;
- - port2 create/update request applied to db;
- - port2 transaction releases the lock
Now at this point postcommit of either of them could happen, because code-pieces operate outside the
locked zone.
If it happens, this way, tunnel would retain:
- - postcommit phase for port1 requests FLOODING_ENTRY deletion due to port1 deletion
- - postcommit phase requests FLOODING_ENTRY + MAC-specific flow creation for port2;
If it happens the below way, tunnel would break:
- - postcommit phase for create por2 requests FLOODING_ENTRY + MAC-specific flow
- - postcommit phase for delete port1 requests FLOODING_ENTRY deletion
'''
We considered the patch to get rid of precommit for backport to Icehouse
[1] that seems to eliminate the race, but we're concerned that we
reverted that to previous behaviour in Juno as part of DVR work [2],
though we haven't done any testing to check whether the issue is present
in Juno (though brief analysis of the code shows that it should fail
there too).
Ideally, the fix for Juno should be easily backportable because the
issue is currently present in Icehouse, and we would like to have the
same fix for both branches (Icehouse and Juno) instead of backporting
patch [1] to Icehouse and implementing another patch for Juno.
[1]: https://review.openstack.org/#/c/95165/
[2]: https://review.openstack.org/#/c/102398/
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1372438
Title:
Race condition in l2pop drops tunnels
Status in OpenStack Neutron (virtual network service):
New
Bug description:
The issue was originally raised by a Red Hat performance engineer (Joe
Talerico) here: https://bugzilla.redhat.com/show_bug.cgi?id=1136969
(see starting from comment 4).
Joe created a Fedora instance in his OS cloud based on RHEL7-OSP5
(Icehouse), where he installed Rally client to run benchmarks against
that cloud itself. He assigned a floating IP to that instance to be
able to access API endpoints from inside the Rally machine. Then he
ran a scenario which basically started up 100+ new instances in
parallel, tried to access each of them via ssh, and once it succeeded,
clean up each created instance (with its ports). Once in a while, his
Rally instance lost connection to outside world. This was because
VXLAN tunnel to the compute node hosting the Rally machine was dropped
on networker node where DHCP, L3, Metadata agents were running. Once
we restarted OVS agent, the tunnel was recreated properly.
The scenario failed only if L2POP mechanism was enabled.
I've looked thru the OVS agent logs and found out that the tunnel was
dropped due to a legitimate fdb entry removal request coming from
neutron-server side. So the fault is probably on neutron-server side,
in l2pop mechanism driver.
I've then looked thru the patches in Juno to see whether there is
something related to the issue already merged, and found the patch
that gets rid of _precommit step when cleaning up fdb entries. Once
we've applied the patch on the neutron-server node, we stopped to
experience those connectivity failures.
After discussion with Vivekanandan Narasimhan, we came up with the
following race condition that may result in tunnels being dropped
while legitimate resources are still using them:
(quoting Vivek below)
'''
- - port1 delete request comes in;
- - port1 delete request acquires lock
- - port2 create/update request comes in;
- - port2 create/update waits on due to unavailability of lock
- - precommit phase for port1 determines that the port is the last one, so we should drop the FLOODING_ENTRY;
- - port1 delete applied to db;
- - port1 transaction releases the lock
- - port2 create/update acquires the lock
- - precommit phase for port2 determines that the port is the first one, so request FLOODING_ENTRY + MAC-specific flow creation;
- - port2 create/update request applied to db;
- - port2 transaction releases the lock
Now at this point postcommit of either of them could happen, because code-pieces operate outside the
locked zone.
If it happens, this way, tunnel would retain:
- - postcommit phase for port1 requests FLOODING_ENTRY deletion due to port1 deletion
- - postcommit phase requests FLOODING_ENTRY + MAC-specific flow creation for port2;
If it happens the below way, tunnel would break:
- - postcommit phase for create por2 requests FLOODING_ENTRY + MAC-specific flow
- - postcommit phase for delete port1 requests FLOODING_ENTRY deletion
'''
We considered the patch to get rid of precommit for backport to
Icehouse [1] that seems to eliminate the race, but we're concerned
that we reverted that to previous behaviour in Juno as part of DVR
work [2], though we haven't done any testing to check whether the
issue is present in Juno (though brief analysis of the code shows that
it should fail there too).
Ideally, the fix for Juno should be easily backportable because the
issue is currently present in Icehouse, and we would like to have the
same fix for both branches (Icehouse and Juno) instead of backporting
patch [1] to Icehouse and implementing another patch for Juno.
[1]: https://review.openstack.org/#/c/95165/
[2]: https://review.openstack.org/#/c/102398/
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1372438/+subscriptions
Follow ups
References