← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1807396] [NEW] With many VMs on the same tenant, the L3 ip neigh add is too slow

 

Public bug reported:

In our setup, we run with DVR, and really a lot of VMs in the same
tenant/project (we have currently between 1500 and 2000 VMs). In such
setup, the internal function _set_subnet_arp_info of
neutron/agent/l3/dvr_local_router.py is taking a way too long. Indeed,
what it does is, on each compute node (since we use a Neutron L3 router
on each compute), operations like:

ip neigh add

for every VM in the project. As we have both ipv4 and ipv6, the L3 agent
does this twice. In our setup, this results in about 4000 Python
processes that have to be spawned to execute the "ip neigh add" command.
This takes between 20 and 30 minutes, each time we either:

- Add a first VM from the tenant to the host
- Restart the compute node
- Restart the L3 agent

So, there's this issue with "ip neigh add", though there's also the same
kind of issue when OVS is doing:

ovs-vsctl add-flows

about 2000 times as well.

So in other words, this doesn't scale, and this needs to be addressed,
so that the L3 agent can react in a reasonable mater to operations on
the DVRs when there's many VMs in the same project.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1807396

Title:
  With many VMs on the same tenant, the L3 ip neigh add is too slow

Status in neutron:
  New

Bug description:
  In our setup, we run with DVR, and really a lot of VMs in the same
  tenant/project (we have currently between 1500 and 2000 VMs). In such
  setup, the internal function _set_subnet_arp_info of
  neutron/agent/l3/dvr_local_router.py is taking a way too long. Indeed,
  what it does is, on each compute node (since we use a Neutron L3
  router on each compute), operations like:

  ip neigh add

  for every VM in the project. As we have both ipv4 and ipv6, the L3
  agent does this twice. In our setup, this results in about 4000 Python
  processes that have to be spawned to execute the "ip neigh add"
  command. This takes between 20 and 30 minutes, each time we either:

  - Add a first VM from the tenant to the host
  - Restart the compute node
  - Restart the L3 agent

  So, there's this issue with "ip neigh add", though there's also the
  same kind of issue when OVS is doing:

  ovs-vsctl add-flows

  about 2000 times as well.

  So in other words, this doesn't scale, and this needs to be addressed,
  so that the L3 agent can react in a reasonable mater to operations on
  the DVRs when there's many VMs in the same project.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1807396/+subscriptions


Follow ups