← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1743658] Re: SLAAC address incorrectly deallocated from HA router port due to race condition

 

Reviewed:  https://review.openstack.org/534456
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fea188acd173fe09e2e6f98534c4b9cb1523ebc6
Submitter: Zuul
Branch:    master

commit fea188acd173fe09e2e6f98534c4b9cb1523ebc6
Author: Ihar Hrachyshka <ihrachys@xxxxxxxxxx>
Date:   Tue Jan 16 13:59:39 2018 -0800

    l3_ha: only pass host into update_port when updating router port bindings
    
    There is a race condition in update_routers_states that may result in
    some fixed ips incorrectly deallocated from router ports. This may
    happen if update_routers_states fetches ports' state before another
    thread updates the list; then update_routers_states passes port payloads
    with old fixed ips into update_port, which results in ip address
    deallocation. Among other things, l3 agent will detect the change and
    remove the affected subnet prefix from radvd configuration file, since
    it doesn't configure extra_subnets for RA.
    
    There is no need to pass full port payload into update_port just to set
    host. This patch replaces the payload with a dict of one key - host.
    This allows core plugin to handle just this host field change, leaving
    existing allocations (and other port attributes) intact.
    
    Change-Id: Ib2c661d6e2cb8e34676fd83e19b6cf65c232545d
    Closes-Bug: #1743658


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1743658

Title:
  SLAAC address incorrectly deallocated from HA router port due to race
  condition

Status in neutron:
  Fix Released

Bug description:
  This was originally reported in Red Hat Bugzilla:
  https://bugzilla.redhat.com/show_bug.cgi?id=1486324

  The issue is triggered when executing
  tempest.scenario.test_network_v6.TestGettingAddress tests in a loop
  with L3 HA enabled. The failure looks as follows:

  Captured traceback:
  ~~~~~~~~~~~~~~~~~~~
      Traceback (most recent call last):
        File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
          return f(*func_args, **func_kwargs)
        File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 258, in test_dualnet_multi_prefix_slaac
          dualnet=True)
        File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_v6.py", line 196, in _prepare_and_test
          (ip, srv['id'], ssh.exec_command("ip address")))
        File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 666, in fail
          raise self.failureException(msg)
      AssertionError: Address 2003::1:f816:3eff:fee0:fbf0 not configured for instance d89f1b14-20ef-47f7-80a4-9d3173446dbc, ip address output is
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
          inet6 ::1/128 scope host 
             valid_lft forever preferred_lft forever
      2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast qlen 1000
          link/ether fa:16:3e:12:f5:ea brd ff:ff:ff:ff:ff:ff
          inet 10.100.0.8/28 brd 10.100.0.15 scope global eth0
          inet6 fe80::f816:3eff:fe12:f5ea/64 scope link 
             valid_lft forever preferred_lft forever
      3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
          link/ether fa:16:3e:e0:fb:f0 brd ff:ff:ff:ff:ff:ff
          inet6 2003::f816:3eff:fee0:fbf0/64 scope global dynamic 
             valid_lft 86335sec preferred_lft 14335sec
          inet6 fe80::f816:3eff:fee0:fbf0/64 scope link 
             valid_lft forever preferred_lft forever

  The test case creates a network with two ipv6 slaac subnets, then
  starts an instance on the network and checks that the instance OS
  configured addresses from both prefixes. It fails because an address
  from the second prefix is not configured.

  When we check l3 agent log, we see that radvd first is correctly
  configured for both prefixes, but then something happens that
  reconfigures radvd again, now without one of prefixes.

  If we trace the event that triggered the second radvd reconfiguration
  back to server, we see that the router update happened as a result of
  update_routers_states execution (which is itself remotely triggered by
  l3 agent).

  We see that the update_routers_states call started before the second
  subnet was added to the router in question. At the same time, we see
  that the call is complete AFTER the subnet is added.

  In server log, we see both allocation and deallocation events for
  router gateway address:

  2018-01-16 19:50:31.459 886987 DEBUG neutron.db.db_base_plugin_common
  [req-13cbe23b-ae7f-472e-9049-601e75e04b6a
  269a2421f89742b09cfd722dc28aca5c 97e8b9e11107489aad70e7e6d172ddce -
  default default] Allocated IP 2003:0:0:1::1
  (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01/6f4b11b3-4a31-4264
  -8c3b-439c8bd903bd) _store_ip_allocation /usr/lib/python2.7/site-
  packages/neutron/db/db_base_plugin_common.py:122

  
  2018-01-16 19:50:35.144 883810 DEBUG neutron.db.db_base_plugin_common [req-445e4d27-16e4-463f-bfce-b7603a739b01 - - - - -] Delete allocated IP 2003:0:0:1::1 (d7bd8a16-6bf6-4a40-b508-c27d963b3912/fc6a62f5-d1a1-4272-aa67-a70ee8b1ee01) _delete_ip_allocation /usr/lib/python2.7/site-packages/neutron/db/db_base_plugin_common.py:108

  The allocation event belongs to add_router_interface, while deletion
  is from update_routers_states.

  Code inspection suggests that deallocation happens because
  update_routers_states does the following:

  1. fetch all router ports;
  2. then for each port payload, set host, and pass the payload into update_port.

  If add_router_interface happened in between those two steps, then we
  risk calling update_port with a port payload that DOESN'T contain a
  fixed_ip that was added during add_router_interface call.

  I think we should avoid passing the whole port payload into
  update_port, instead just pass a dict with a single key of host. This
  is both semantically correct, fixes the race condition, and in theory
  may be slightly quicker since the core plugin won't need to process
  fields that were not changed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1743658/+subscriptions


References