yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1743658] [NEW] SLAAC address incorrectly deallocated from HA router port due to race condition

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Ihar Hrachyshka <1743658@xxxxxxxxxxxxxxxxxx>
Date: Tue, 16 Jan 2018 21:58:55 -0000
Reply-to: Bug 1743658 <1743658@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

This was originally reported in Red Hat Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1486324

The issue is triggered when executing
tempest.scenario.test_network_v6.TestGettingAddress tests in a loop with
L3 HA enabled. The failure looks as follows:

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
        return f(*func_args, **func_kwargs)
      File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 258, in test_dualnet_multi_prefix_slaac
        dualnet=True)
      File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test
        (ip, srv['id'], ssh.exec_command("ip address")))
      File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
        raise self.failureException(msg)
    AssertionError: Address 2003::1:f816:3eff:fee0:fbf0 not configured for instance d89f1b14-20ef-47f7-80a4-9d3173446dbc, ip address output is
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
        inet6 ::1/128 scope host 
           valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000
        link/ether fa:16:3e:12:f5:ea brd ff:ff:ff:ff:ff:ff
        inet 10.100.0.8/28 brd 10.100.0.15 scope global eth0
        inet6 fe80::f816:3eff:fe12:f5ea/64 scope link 
           valid_lft forever preferred_lft forever
    3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
        link/ether fa:16:3e:e0:fb:f0 brd ff:ff:ff:ff:ff:ff
        inet6 2003::f816:3eff:fee0:fbf0/64 scope global dynamic 
           valid_lft 86335sec preferred_lft 14335sec
        inet6 fe80::f816:3eff:fee0:fbf0/64 scope link 
           valid_lft forever preferred_lft forever

The test case creates a network with two ipv6 slaac subnets, then starts
an instance on the network and checks that the instance OS configured
addresses from both prefixes. It fails because an address from the
second prefix is not configured.

When we check l3 agent log, we see that radvd first is correctly
configured for both prefixes, but then something happens that
reconfigures radvd again, now without one of prefixes.

If we trace the event that triggered the second radvd reconfiguration
back to server, we see that the router update happened as a result of
update_routers_states execution (which is itself remotely triggered by
l3 agent).

We see that the update_routers_states call started before the second
subnet was added to the router in question. At the same time, we see
that the call is complete AFTER the subnet is added.

In server log, we see both allocation and deallocation events for router
gateway address:

2018-01-16 19:50:31.459 886987 DEBUG neutron.db.db_base_plugin_common
[req-13cbe23b-ae7f-472e-9049-601e75e04b6a
269a2421f89742b09cfd722dc28aca5c 97e8b9e11107489aad70e7e6d172ddce -
default default] Allocated IP 2003:0:0:1::1
(d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01/6f4b11b3-4a31-4264
-8c3b-439c8bd903bd) _store_ip_allocation /usr/lib/python2.7/site-
packages/neutron/db/db_base_plugin_common.py:122


2018-01-16 19:50:35.144 883810 DEBUG neutron.db.db_base_plugin_common [req-445e4d27-16e4-463f-bfce-b7603a739b01 - - - - -] Delete allocated IP 2003:0:0:1::1 (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108

The allocation event belongs to add_router_interface, while deletion is
from update_routers_states.

Code inspection suggests that deallocation happens because
update_routers_states does the following:

1. fetch all router ports;
2. then for each port payload, set host, and pass the payload into update_port.

If add_router_interface happened in between those two steps, then we
risk calling update_port with a port payload that DOESN'T contain a
fixed_ip that was added during add_router_interface call.

I think we should avoid passing the whole port payload into update_port,
instead just pass a dict with a single key of host. This is both
semantically correct, fixes the race condition, and in theory may be
slightly quicker since the core plugin won't need to process fields that
were not changed.

** Affects: neutron
     Importance: High
     Assignee: Ihar Hrachyshka (ihar-hrachyshka)
         Status: Confirmed


** Tags: l3-ha

** Changed in: neutron
       Status: New => Confirmed

** Changed in: neutron
   Importance: Undecided => High

** Changed in: neutron
     Assignee: (unassigned) => Ihar Hrachyshka (ihar-hrachyshka)

** Tags added: l3-ha

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1743658

Title:
  SLAAC address incorrectly deallocated from HA router port due to race
  condition

Status in neutron:
  Confirmed

Bug description:
  This was originally reported in Red Hat Bugzilla:
  https://bugzilla.redhat.com/show_bug.cgi?id=1486324

  The issue is triggered when executing
  tempest.scenario.test_network_v6.TestGettingAddress tests in a loop
  with L3 HA enabled. The failure looks as follows:

  Captured traceback:
  ~~~~~~~~~~~~~~~~~~~
      Traceback (most recent call last):
        File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
          return f(*func_args, **func_kwargs)
        File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 258, in test_dualnet_multi_prefix_slaac
          dualnet=True)
        File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test
          (ip, srv['id'], ssh.exec_command("ip address")))
        File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
          raise self.failureException(msg)
      AssertionError: Address 2003::1:f816:3eff:fee0:fbf0 not configured for instance d89f1b14-20ef-47f7-80a4-9d3173446dbc, ip address output is
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
          inet6 ::1/128 scope host 
             valid_lft forever preferred_lft forever
      2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000
          link/ether fa:16:3e:12:f5:ea brd ff:ff:ff:ff:ff:ff
          inet 10.100.0.8/28 brd 10.100.0.15 scope global eth0
          inet6 fe80::f816:3eff:fe12:f5ea/64 scope link 
             valid_lft forever preferred_lft forever
      3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
          link/ether fa:16:3e:e0:fb:f0 brd ff:ff:ff:ff:ff:ff
          inet6 2003::f816:3eff:fee0:fbf0/64 scope global dynamic 
             valid_lft 86335sec preferred_lft 14335sec
          inet6 fe80::f816:3eff:fee0:fbf0/64 scope link 
             valid_lft forever preferred_lft forever

  The test case creates a network with two ipv6 slaac subnets, then
  starts an instance on the network and checks that the instance OS
  configured addresses from both prefixes. It fails because an address
  from the second prefix is not configured.

  When we check l3 agent log, we see that radvd first is correctly
  configured for both prefixes, but then something happens that
  reconfigures radvd again, now without one of prefixes.

  If we trace the event that triggered the second radvd reconfiguration
  back to server, we see that the router update happened as a result of
  update_routers_states execution (which is itself remotely triggered by
  l3 agent).

  We see that the update_routers_states call started before the second
  subnet was added to the router in question. At the same time, we see
  that the call is complete AFTER the subnet is added.

  In server log, we see both allocation and deallocation events for
  router gateway address:

  2018-01-16 19:50:31.459 886987 DEBUG neutron.db.db_base_plugin_common
  [req-13cbe23b-ae7f-472e-9049-601e75e04b6a
  269a2421f89742b09cfd722dc28aca5c 97e8b9e11107489aad70e7e6d172ddce -
  default default] Allocated IP 2003:0:0:1::1
  (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01/6f4b11b3-4a31-4264
  -8c3b-439c8bd903bd) _store_ip_allocation /usr/lib/python2.7/site-
  packages/neutron/db/db_base_plugin_common.py:122

  
  2018-01-16 19:50:35.144 883810 DEBUG neutron.db.db_base_plugin_common [req-445e4d27-16e4-463f-bfce-b7603a739b01 - - - - -] Delete allocated IP 2003:0:0:1::1 (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108

  The allocation event belongs to add_router_interface, while deletion
  is from update_routers_states.

  Code inspection suggests that deallocation happens because
  update_routers_states does the following:

  1. fetch all router ports;
  2. then for each port payload, set host, and pass the payload into update_port.

  If add_router_interface happened in between those two steps, then we
  risk calling update_port with a port payload that DOESN'T contain a
  fixed_ip that was added during add_router_interface call.

  I think we should avoid passing the whole port payload into
  update_port, instead just pass a dict with a single key of host. This
  is both semantically correct, fixes the race condition, and in theory
  may be slightly quicker since the core plugin won't need to process
  fields that were not changed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1743658/+subscriptions
Follow ups

[Bug 1743658] Re: SLAAC address incorrectly deallocated from HA router port due to race condition
From: OpenStack Infra, 2018-01-21