← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1450696] [NEW] With LinuxBridge/VXLAN ARP proxy, ip neigh replace fails due to ARP entry limits

 

Public bug reported:

In an environment with over 600 instances, we observed failures by the
LinuxBridge agent w/ l2pop on the network nodes to add neighbor (arp)
entries when booting instances. The lack of an ARP entry resulted in the
qrouter namespaces being unable to communicate with the instances, as
their ARP request was not proxied and was dropped. The 'ip neigh
replace' command could be seen failing within the log with a 'RTNETLINK
answers: No buffer space available' message. To resolve this, we
increased the gc_thresh sysctl parameters from their defaults.

To demonstrate, we booted four instances:

infra03_neutron_agents_container-68756ad0:~# nova list
+--------------------------------------+------------------+--------++-------------+------------------------------------+
| ID                                   | Name             | Status || Power State | Networks                           |
+--------------------------------------+------------------+--------++-------------+------------------------------------+
| 0b5678f8-fbaf-475c-908b-fab2300b76e7 | 20150430-JD-RAX1 | ACTIVE || Running     | management-network=10.87.80.39     |
| be2ecc51-cf2b-469d-b768-d262ad2debe9 | 20150430-JD-RAX2 | ACTIVE || Running     | management-network=10.87.80.40     |
| a41b432c-1704-47c4-aa37-22ecde422a73 | 20150430-JD-RAX3 | ACTIVE || Running     | management-network=10.87.80.41     |
| b2c4a80c-06ac-42e6-9ed3-06875a0f1c98 | 20150430-JD-RAX4 | ACTIVE || Running     | management-network=10.87.80.42     |

Three of the four 'ip neigh replace' commands failed on one of the infra
nodes running an l3 agent. Coincidentally, the one hosting the router
for the respective tenant network:

2015-04-30 19:06:01.835 748 ERROR neutron.agent.linux.utils [req-85a689c4-5056-4b00-a181-06f0a4a51a90 None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'neigh', 'replace', '10.87.80.40', 'lladdr', 'fa:16:3e:42:8d:28', 'dev', 'vxlan-17', 'nud', 'permanent']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No buffer space available\n'
2015-04-30 19:06:08.825 748 INFO neutron.agent.securitygroups_rpc [req-f4034bb3-f15c-4911-b676-bfce60123979 None] Security group member updated [u'dd6ae41a-165b-4f3c-8ffd-ef6e66e64f1e']
2015-04-30 19:06:21.800 748 ERROR neutron.agent.linux.utils [req-4f8da54b-fbe5-469d-ab4a-1ff1eeb20a9c None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'neigh', 'replace', '10.87.80.41', 'lladdr', 'fa:16:3e:31:6b:d6', 'dev', 'vxlan-17', 'nud', 'permanent']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No buffer space available\n'
2015-04-30 19:06:34.585 748 INFO neutron.agent.securitygroups_rpc [req-645ef3f8-e481-4e58-a95b-0a5f9562d4af None] Security group member updated [u'dd6ae41a-165b-4f3c-8ffd-ef6e66e64f1e']
2015-04-30 19:06:44.641 748 ERROR neutron.agent.linux.utils [req-aa4b819e-3351-483b-b45c-db330c5b039f None]
Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'neigh', 'replace', '10.87.80.42', 'lladdr', 'fa:16:3e:11:51:60', 'dev', 'vxlan-17', 'nud', 'permanent']
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No buffer space available\n'

The failure was verified by the lack of a permanent ARP entry on the
infra node for the three instances above:

root@infra01_neutron_agents_container-4c850328:~# arp -an | grep 10.87.80.39
? (10.87.80.39) at fa:16:3e:3e:4d:30 [ether] PERM on vxlan-17
root@infra01_neutron_agents_container-4c850328:~# arp -an | grep 10.87.80.40
root@infra01_neutron_agents_container-4c850328:~# arp -an | grep 10.87.80.41
root@infra01_neutron_agents_container-4c850328:~# arp -an | grep 10.87.80.42

We increased the gc_thresh sysctl parameters from their defaults:

FROM:
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024

TO:
sysctl -w net.ipv4.neigh.default.gc_thresh1=1024
sysctl -w net.ipv4.neigh.default.gc_thresh2=4096
sysctl -w net.ipv4.neigh.default.gc_thresh3=8192

They may not be ideal values, but nonetheless, increasing those values
allowed subsequent instances to be booted without issue.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1450696

Title:
  With LinuxBridge/VXLAN ARP proxy, ip neigh replace fails due to ARP
  entry limits

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  In an environment with over 600 instances, we observed failures by the
  LinuxBridge agent w/ l2pop on the network nodes to add neighbor (arp)
  entries when booting instances. The lack of an ARP entry resulted in
  the qrouter namespaces being unable to communicate with the instances,
  as their ARP request was not proxied and was dropped. The 'ip neigh
  replace' command could be seen failing within the log with a
  'RTNETLINK answers: No buffer space available' message. To resolve
  this, we increased the gc_thresh sysctl parameters from their
  defaults.

  To demonstrate, we booted four instances:

  infra03_neutron_agents_container-68756ad0:~# nova list
  +--------------------------------------+------------------+--------++-------------+------------------------------------+
  | ID                                   | Name             | Status || Power State | Networks                           |
  +--------------------------------------+------------------+--------++-------------+------------------------------------+
  | 0b5678f8-fbaf-475c-908b-fab2300b76e7 | 20150430-JD-RAX1 | ACTIVE || Running     | management-network=10.87.80.39     |
  | be2ecc51-cf2b-469d-b768-d262ad2debe9 | 20150430-JD-RAX2 | ACTIVE || Running     | management-network=10.87.80.40     |
  | a41b432c-1704-47c4-aa37-22ecde422a73 | 20150430-JD-RAX3 | ACTIVE || Running     | management-network=10.87.80.41     |
  | b2c4a80c-06ac-42e6-9ed3-06875a0f1c98 | 20150430-JD-RAX4 | ACTIVE || Running     | management-network=10.87.80.42     |

  Three of the four 'ip neigh replace' commands failed on one of the
  infra nodes running an l3 agent. Coincidentally, the one hosting the
  router for the respective tenant network:

  2015-04-30 19:06:01.835 748 ERROR neutron.agent.linux.utils [req-85a689c4-5056-4b00-a181-06f0a4a51a90 None]
  Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'neigh', 'replace', '10.87.80.40', 'lladdr', 'fa:16:3e:42:8d:28', 'dev', 'vxlan-17', 'nud', 'permanent']
  Exit code: 2
  Stdout: ''
  Stderr: 'RTNETLINK answers: No buffer space available\n'
  2015-04-30 19:06:08.825 748 INFO neutron.agent.securitygroups_rpc [req-f4034bb3-f15c-4911-b676-bfce60123979 None] Security group member updated [u'dd6ae41a-165b-4f3c-8ffd-ef6e66e64f1e']
  2015-04-30 19:06:21.800 748 ERROR neutron.agent.linux.utils [req-4f8da54b-fbe5-469d-ab4a-1ff1eeb20a9c None]
  Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'neigh', 'replace', '10.87.80.41', 'lladdr', 'fa:16:3e:31:6b:d6', 'dev', 'vxlan-17', 'nud', 'permanent']
  Exit code: 2
  Stdout: ''
  Stderr: 'RTNETLINK answers: No buffer space available\n'
  2015-04-30 19:06:34.585 748 INFO neutron.agent.securitygroups_rpc [req-645ef3f8-e481-4e58-a95b-0a5f9562d4af None] Security group member updated [u'dd6ae41a-165b-4f3c-8ffd-ef6e66e64f1e']
  2015-04-30 19:06:44.641 748 ERROR neutron.agent.linux.utils [req-aa4b819e-3351-483b-b45c-db330c5b039f None]
  Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'neigh', 'replace', '10.87.80.42', 'lladdr', 'fa:16:3e:11:51:60', 'dev', 'vxlan-17', 'nud', 'permanent']
  Exit code: 2
  Stdout: ''
  Stderr: 'RTNETLINK answers: No buffer space available\n'

  The failure was verified by the lack of a permanent ARP entry on the
  infra node for the three instances above:

  root@infra01_neutron_agents_container-4c850328:~# arp -an | grep 10.87.80.39
  ? (10.87.80.39) at fa:16:3e:3e:4d:30 [ether] PERM on vxlan-17
  root@infra01_neutron_agents_container-4c850328:~# arp -an | grep 10.87.80.40
  root@infra01_neutron_agents_container-4c850328:~# arp -an | grep 10.87.80.41
  root@infra01_neutron_agents_container-4c850328:~# arp -an | grep 10.87.80.42

  We increased the gc_thresh sysctl parameters from their defaults:

  FROM:
  net.ipv4.neigh.default.gc_thresh1 = 128
  net.ipv4.neigh.default.gc_thresh2 = 512
  net.ipv4.neigh.default.gc_thresh3 = 1024

  TO:
  sysctl -w net.ipv4.neigh.default.gc_thresh1=1024
  sysctl -w net.ipv4.neigh.default.gc_thresh2=4096
  sysctl -w net.ipv4.neigh.default.gc_thresh3=8192

  They may not be ideal values, but nonetheless, increasing those values
  allowed subsequent instances to be booted without issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1450696/+subscriptions


Follow ups

References