← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1838431] [NEW] [scale issue] ovs-agent port processing time increases linearly and eventually timeouts

 

Public bug reported:

ENV: stable/queens
But master has basically same code, so the issue may also exist.

Config: L2 ovs-agent with enabled openflow based security group.

Recently I run one extreme test locally, booting 2700 instances for one single tenant.
The instance will be booted in 2000 networks. But the entire tenant has only one security group with only 5 rules. (This is the key point to the problem.)

The result is totally unacceptable. Almost 2000+ instances failed to
boot (ERROR), and almost every of them meets the "vif-plug-timeout"
exception.


How to reproduce:
1. create 2700 networks one by one "openstack network create"
2. create one IPv4 subnet and one IPv6 subnet for every network
3. create 2700 router (one single tenant can not create HA router more than 255, because of the VRID range) and connect to these subnets
4.  boot instances
for i in {1..100}
do
    for i in {1..27}
        nova boot --nic net-name="test-network-xxx" ...
    done
    echo "CLI: boot 27 VMs"
    sleep 30s
done


I have some clue of this issue, the linearly processing time increasing is something like this:
(1) rpc_loop X
5 port added to the ovs-agent, they are processed and will be add to the updated list due to the local notification
(2) rpc_loop X + 1 
another 10 ports are added to the ovs-agent, and 10 updated-port to local notification.
This loop the processing time is 5 ports update processing time, and 10 added port processing.
(3) rpc_loop X + 2
another 20 are ports added to ovs-agent,
10 updated + 20 added port processing time

And the worse thing is, when the port number is getting larger, every port under this one security group will be related. The openflow based security group processing time is get longer and longer.
Until some instance ports meet the timeout of vif-plug. And the instance get failed to boot.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1838431

Title:
  [scale issue] ovs-agent port processing time increases linearly and
  eventually timeouts

Status in neutron:
  New

Bug description:
  ENV: stable/queens
  But master has basically same code, so the issue may also exist.

  Config: L2 ovs-agent with enabled openflow based security group.

  Recently I run one extreme test locally, booting 2700 instances for one single tenant.
  The instance will be booted in 2000 networks. But the entire tenant has only one security group with only 5 rules. (This is the key point to the problem.)

  The result is totally unacceptable. Almost 2000+ instances failed to
  boot (ERROR), and almost every of them meets the "vif-plug-timeout"
  exception.

  
  How to reproduce:
  1. create 2700 networks one by one "openstack network create"
  2. create one IPv4 subnet and one IPv6 subnet for every network
  3. create 2700 router (one single tenant can not create HA router more than 255, because of the VRID range) and connect to these subnets
  4.  boot instances
  for i in {1..100}
  do
      for i in {1..27}
          nova boot --nic net-name="test-network-xxx" ...
      done
      echo "CLI: boot 27 VMs"
      sleep 30s
  done

  
  I have some clue of this issue, the linearly processing time increasing is something like this:
  (1) rpc_loop X
  5 port added to the ovs-agent, they are processed and will be add to the updated list due to the local notification
  (2) rpc_loop X + 1 
  another 10 ports are added to the ovs-agent, and 10 updated-port to local notification.
  This loop the processing time is 5 ports update processing time, and 10 added port processing.
  (3) rpc_loop X + 2
  another 20 are ports added to ovs-agent,
  10 updated + 20 added port processing time

  And the worse thing is, when the port number is getting larger, every port under this one security group will be related. The openflow based security group processing time is get longer and longer.
  Until some instance ports meet the timeout of vif-plug. And the instance get failed to boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1838431/+subscriptions


Follow ups