yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #70412
[Bug 1743658] [NEW] SLAAC address incorrectly deallocated from HA router port due to race condition
Public bug reported:
This was originally reported in Red Hat Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1486324
The issue is triggered when executing
tempest.scenario.test_network_v6.TestGettingAddress tests in a loop with
L3 HA enabled. The failure looks as follows:
Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
return f(*func_args, **func_kwargs)
File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 258, in test_dualnet_multi_prefix_slaac
dualnet=True)
File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test
(ip, srv['id'], ssh.exec_command("ip address")))
File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
raise self.failureException(msg)
AssertionError: Address 2003::1:f816:3eff:fee0:fbf0 not configured for instance d89f1b14-20ef-47f7-80a4-9d3173446dbc, ip address output is
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000
link/ether fa:16:3e:12:f5:ea brd ff:ff:ff:ff:ff:ff
inet 10.100.0.8/28 brd 10.100.0.15 scope global eth0
inet6 fe80::f816:3eff:fe12:f5ea/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether fa:16:3e:e0:fb:f0 brd ff:ff:ff:ff:ff:ff
inet6 2003::f816:3eff:fee0:fbf0/64 scope global dynamic
valid_lft 86335sec preferred_lft 14335sec
inet6 fe80::f816:3eff:fee0:fbf0/64 scope link
valid_lft forever preferred_lft forever
The test case creates a network with two ipv6 slaac subnets, then starts
an instance on the network and checks that the instance OS configured
addresses from both prefixes. It fails because an address from the
second prefix is not configured.
When we check l3 agent log, we see that radvd first is correctly
configured for both prefixes, but then something happens that
reconfigures radvd again, now without one of prefixes.
If we trace the event that triggered the second radvd reconfiguration
back to server, we see that the router update happened as a result of
update_routers_states execution (which is itself remotely triggered by
l3 agent).
We see that the update_routers_states call started before the second
subnet was added to the router in question. At the same time, we see
that the call is complete AFTER the subnet is added.
In server log, we see both allocation and deallocation events for router
gateway address:
2018-01-16 19:50:31.459 886987 DEBUG neutron.db.db_base_plugin_common
[req-13cbe23b-ae7f-472e-9049-601e75e04b6a
269a2421f89742b09cfd722dc28aca5c 97e8b9e11107489aad70e7e6d172ddce -
default default] Allocated IP 2003:0:0:1::1
(d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01/6f4b11b3-4a31-4264
-8c3b-439c8bd903bd) _store_ip_allocation /usr/lib/python2.7/site-
packages/neutron/db/db_base_plugin_common.py:122
2018-01-16 19:50:35.144 883810 DEBUG neutron.db.db_base_plugin_common [req-445e4d27-16e4-463f-bfce-b7603a739b01 - - - - -] Delete allocated IP 2003:0:0:1::1 (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108
The allocation event belongs to add_router_interface, while deletion is
from update_routers_states.
Code inspection suggests that deallocation happens because
update_routers_states does the following:
1. fetch all router ports;
2. then for each port payload, set host, and pass the payload into update_port.
If add_router_interface happened in between those two steps, then we
risk calling update_port with a port payload that DOESN'T contain a
fixed_ip that was added during add_router_interface call.
I think we should avoid passing the whole port payload into update_port,
instead just pass a dict with a single key of host. This is both
semantically correct, fixes the race condition, and in theory may be
slightly quicker since the core plugin won't need to process fields that
were not changed.
** Affects: neutron
Importance: High
Assignee: Ihar Hrachyshka (ihar-hrachyshka)
Status: Confirmed
** Tags: l3-ha
** Changed in: neutron
Status: New => Confirmed
** Changed in: neutron
Importance: Undecided => High
** Changed in: neutron
Assignee: (unassigned) => Ihar Hrachyshka (ihar-hrachyshka)
** Tags added: l3-ha
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1743658
Title:
SLAAC address incorrectly deallocated from HA router port due to race
condition
Status in neutron:
Confirmed
Bug description:
This was originally reported in Red Hat Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1486324
The issue is triggered when executing
tempest.scenario.test_network_v6.TestGettingAddress tests in a loop
with L3 HA enabled. The failure looks as follows:
Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
return f(*func_args, **func_kwargs)
File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 258, in test_dualnet_multi_prefix_slaac
dualnet=True)
File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test
(ip, srv['id'], ssh.exec_command("ip address")))
File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
raise self.failureException(msg)
AssertionError: Address 2003::1:f816:3eff:fee0:fbf0 not configured for instance d89f1b14-20ef-47f7-80a4-9d3173446dbc, ip address output is
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000
link/ether fa:16:3e:12:f5:ea brd ff:ff:ff:ff:ff:ff
inet 10.100.0.8/28 brd 10.100.0.15 scope global eth0
inet6 fe80::f816:3eff:fe12:f5ea/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
link/ether fa:16:3e:e0:fb:f0 brd ff:ff:ff:ff:ff:ff
inet6 2003::f816:3eff:fee0:fbf0/64 scope global dynamic
valid_lft 86335sec preferred_lft 14335sec
inet6 fe80::f816:3eff:fee0:fbf0/64 scope link
valid_lft forever preferred_lft forever
The test case creates a network with two ipv6 slaac subnets, then
starts an instance on the network and checks that the instance OS
configured addresses from both prefixes. It fails because an address
from the second prefix is not configured.
When we check l3 agent log, we see that radvd first is correctly
configured for both prefixes, but then something happens that
reconfigures radvd again, now without one of prefixes.
If we trace the event that triggered the second radvd reconfiguration
back to server, we see that the router update happened as a result of
update_routers_states execution (which is itself remotely triggered by
l3 agent).
We see that the update_routers_states call started before the second
subnet was added to the router in question. At the same time, we see
that the call is complete AFTER the subnet is added.
In server log, we see both allocation and deallocation events for
router gateway address:
2018-01-16 19:50:31.459 886987 DEBUG neutron.db.db_base_plugin_common
[req-13cbe23b-ae7f-472e-9049-601e75e04b6a
269a2421f89742b09cfd722dc28aca5c 97e8b9e11107489aad70e7e6d172ddce -
default default] Allocated IP 2003:0:0:1::1
(d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01/6f4b11b3-4a31-4264
-8c3b-439c8bd903bd) _store_ip_allocation /usr/lib/python2.7/site-
packages/neutron/db/db_base_plugin_common.py:122
2018-01-16 19:50:35.144 883810 DEBUG neutron.db.db_base_plugin_common [req-445e4d27-16e4-463f-bfce-b7603a739b01 - - - - -] Delete allocated IP 2003:0:0:1::1 (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108
The allocation event belongs to add_router_interface, while deletion
is from update_routers_states.
Code inspection suggests that deallocation happens because
update_routers_states does the following:
1. fetch all router ports;
2. then for each port payload, set host, and pass the payload into update_port.
If add_router_interface happened in between those two steps, then we
risk calling update_port with a port payload that DOESN'T contain a
fixed_ip that was added during add_router_interface call.
I think we should avoid passing the whole port payload into
update_port, instead just pass a dict with a single key of host. This
is both semantically correct, fixes the race condition, and in theory
may be slightly quicker since the core plugin won't need to process
fields that were not changed.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1743658/+subscriptions
Follow ups