← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1715734] [NEW] Gratuitous ARP for floating IPs not so gratuitous

 

Public bug reported:

OpenStack Release: Newton
OS: Ubuntu 16.04 LTS

When working in an environment with multiple application deployments
that build up/tear down routers and floating ips, it has been observed
that connectivity to new instances using recycled floating IPs may be
impacted.

In this environment, the external provider network is connected to a
Cisco Nexus 7010 with a default arp cache timeout of 1500 seconds. We
have observed that the L3 agent is sending out the following arpings
when floating IPs are assigned:

2017-09-07 16:57:17.396 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-A', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.36'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2017-09-07 16:57:19.644 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.29'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2017-09-07 16:57:19.913 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.44'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89

Here's the respective packet capture:

18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 tell 172.29.77.39, length 28
18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 tell 172.29.77.39, length 28
18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 tell 172.29.77.39, length 28

The source address in all of those ARP requests is 172.29.77.39 - the IP
primary address on the qg interface. The ARP entry for the recycled
floating IPs on the Nexus is not being refreshed and remains stale. For
the gratuitous ARP to be successful, the source IP needs to be changed
to the respective floating IP, so that both the source and destination
IPs are the same. The following code change was made in ip_lib.py:

FROM:
arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
              # Pass -w to set timeout to ensure exit if interface
              # removed while running
              '-w', 1.5, address]

TO:
arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
              # Pass -w to set timeout to ensure exit if interface
              # removed while running
              '-w', 1.5, '-S', address, address]

With that change in place, the following packet captures reflects the
new behavior:

18:10:30.389966 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 tell 172.29.77.36, length 28
18:10:30.390068 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 tell 172.29.77.29, length 28
18:10:30.390143 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 tell 172.29.77.44, length 28

Since making the change, we have not had a failed deployment and all
recycled floating IPs appear to be reachable immediately.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1715734

Title:
  Gratuitous ARP for floating IPs not so gratuitous

Status in neutron:
  New

Bug description:
  OpenStack Release: Newton
  OS: Ubuntu 16.04 LTS

  When working in an environment with multiple application deployments
  that build up/tear down routers and floating ips, it has been observed
  that connectivity to new instances using recycled floating IPs may be
  impacted.

  In this environment, the external provider network is connected to a
  Cisco Nexus 7010 with a default arp cache timeout of 1500 seconds. We
  have observed that the L3 agent is sending out the following arpings
  when floating IPs are assigned:

  2017-09-07 16:57:17.396 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-A', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.36'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-09-07 16:57:19.644 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.29'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-09-07 16:57:19.913 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.44'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89

  Here's the respective packet capture:

  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 tell 172.29.77.39, length 28
  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 tell 172.29.77.39, length 28
  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 tell 172.29.77.39, length 28

  The source address in all of those ARP requests is 172.29.77.39 - the
  IP primary address on the qg interface. The ARP entry for the recycled
  floating IPs on the Nexus is not being refreshed and remains stale.
  For the gratuitous ARP to be successful, the source IP needs to be
  changed to the respective floating IP, so that both the source and
  destination IPs are the same. The following code change was made in
  ip_lib.py:

  FROM:
  arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
                # Pass -w to set timeout to ensure exit if interface
                # removed while running
                '-w', 1.5, address]

  TO:
  arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
                # Pass -w to set timeout to ensure exit if interface
                # removed while running
                '-w', 1.5, '-S', address, address]

  With that change in place, the following packet captures reflects the
  new behavior:

  18:10:30.389966 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 tell 172.29.77.36, length 28
  18:10:30.390068 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 tell 172.29.77.29, length 28
  18:10:30.390143 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 tell 172.29.77.44, length 28

  Since making the change, we have not had a failed deployment and all
  recycled floating IPs appear to be reachable immediately.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1715734/+subscriptions


Follow ups