yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #70492
[Bug 1743658] Re: SLAAC address incorrectly deallocated from HA router port due to race condition
Reviewed: https://review.openstack.org/534456
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fea188acd173fe09e2e6f98534c4b9cb1523ebc6
Submitter: Zuul
Branch: master
commit fea188acd173fe09e2e6f98534c4b9cb1523ebc6
Author: Ihar Hrachyshka <ihrachys@xxxxxxxxxx>
Date: Tue Jan 16 13:59:39 2018 -0800
l3_ha: only pass host into update_port when updating router port bindings
There is a race condition in update_routers_states that may result in
some fixed ips incorrectly deallocated from router ports. This may
happen if update_routers_states fetches ports' state before another
thread updates the list; then update_routers_states passes port payloads
with old fixed ips into update_port, which results in ip address
deallocation. Among other things, l3 agent will detect the change and
remove the affected subnet prefix from radvd configuration file, since
it doesn't configure extra_subnets for RA.
There is no need to pass full port payload into update_port just to set
host. This patch replaces the payload with a dict of one key - host.
This allows core plugin to handle just this host field change, leaving
existing allocations (and other port attributes) intact.
Change-Id: Ib2c661d6e2cb8e34676fd83e19b6cf65c232545d
Closes-Bug: #1743658
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1743658
Title:
SLAAC address incorrectly deallocated from HA router port due to race
condition
Status in neutron:
Fix Released
Bug description:
This was originally reported in Red Hat Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1486324
The issue is triggered when executing
tempest.scenario.test_network_v6.TestGettingAddress tests in a loop
with L3 HA enabled. The failure looks as follows:
Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
return f(*func_args, **func_kwargs)
File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 258, in test_dualnet_multi_prefix_slaac
dualnet=True)
File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test
(ip, srv['id'], ssh.exec_command("ip address")))
File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
raise self.failureException(msg)
AssertionError: Address 2003::1:f816:3eff:fee0:fbf0 not configured for instance d89f1b14-20ef-47f7-80a4-9d3173446dbc, ip address output is
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000
link/ether fa:16:3e:12:f5:ea brd ff:ff:ff:ff:ff:ff
inet 10.100.0.8/28 brd 10.100.0.15 scope global eth0
inet6 fe80::f816:3eff:fe12:f5ea/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether fa:16:3e:e0:fb:f0 brd ff:ff:ff:ff:ff:ff
inet6 2003::f816:3eff:fee0:fbf0/64 scope global dynamic
valid_lft 86335sec preferred_lft 14335sec
inet6 fe80::f816:3eff:fee0:fbf0/64 scope link
valid_lft forever preferred_lft forever
The test case creates a network with two ipv6 slaac subnets, then
starts an instance on the network and checks that the instance OS
configured addresses from both prefixes. It fails because an address
from the second prefix is not configured.
When we check l3 agent log, we see that radvd first is correctly
configured for both prefixes, but then something happens that
reconfigures radvd again, now without one of prefixes.
If we trace the event that triggered the second radvd reconfiguration
back to server, we see that the router update happened as a result of
update_routers_states execution (which is itself remotely triggered by
l3 agent).
We see that the update_routers_states call started before the second
subnet was added to the router in question. At the same time, we see
that the call is complete AFTER the subnet is added.
In server log, we see both allocation and deallocation events for
router gateway address:
2018-01-16 19:50:31.459 886987 DEBUG neutron.db.db_base_plugin_common
[req-13cbe23b-ae7f-472e-9049-601e75e04b6a
269a2421f89742b09cfd722dc28aca5c 97e8b9e11107489aad70e7e6d172ddce -
default default] Allocated IP 2003:0:0:1::1
(d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01/6f4b11b3-4a31-4264
-8c3b-439c8bd903bd) _store_ip_allocation /usr/lib/python2.7/site-
packages/neutron/db/db_base_plugin_common.py:122
2018-01-16 19:50:35.144 883810 DEBUG neutron.db.db_base_plugin_common [req-445e4d27-16e4-463f-bfce-b7603a739b01 - - - - -] Delete allocated IP 2003:0:0:1::1 (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108
The allocation event belongs to add_router_interface, while deletion
is from update_routers_states.
Code inspection suggests that deallocation happens because
update_routers_states does the following:
1. fetch all router ports;
2. then for each port payload, set host, and pass the payload into update_port.
If add_router_interface happened in between those two steps, then we
risk calling update_port with a port payload that DOESN'T contain a
fixed_ip that was added during add_router_interface call.
I think we should avoid passing the whole port payload into
update_port, instead just pass a dict with a single key of host. This
is both semantically correct, fixes the race condition, and in theory
may be slightly quicker since the core plugin won't need to process
fields that were not changed.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1743658/+subscriptions
References