yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #94981
[Bug 1838431] Re: [scale issue] ovs-agent port processing time increases linearly and eventually timeouts
Marking invalid based on last comment, please re-open if necessary.
** Changed in: neutron
Status: Confirmed => Invalid
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1838431
Title:
[scale issue] ovs-agent port processing time increases linearly and
eventually timeouts
Status in neutron:
Invalid
Bug description:
ENV: stable/queens
But master has basically same code, so the issue may also exist.
Config: L2 ovs-agent with enabled openflow based security group.
Recently I run one extreme test locally, booting 2700 instances for one single tenant.
The instance will be booted in 2000 networks. But the entire tenant has only one security group with only 5 rules. (This is the key point to the problem.)
The result is totally unacceptable. Almost 2000+ instances failed to
boot (ERROR), and almost every of them meets the "vif-plug-timeout"
exception.
How to reproduce:
1. create 2700 networks one by one "openstack network create"
2. create one IPv4 subnet and one IPv6 subnet for every network
3. create 2700 router (one single tenant can not create HA router more than 255, because of the VRID range) and connect to these subnets
4. boot instances
for i in {1..100}
do
for i in {1..27}
nova boot --nic net-name="test-network-xxx" ...
done
echo "CLI: boot 27 VMs"
sleep 30s
done
I have some clue of this issue, the linearly processing time increasing is something like this:
(1) rpc_loop X
5 port added to the ovs-agent, they are processed and will be add to the updated list due to the local notification
(2) rpc_loop X + 1
another 10 ports are added to the ovs-agent, and 10 updated-port to local notification.
This loop the processing time is 5 ports update processing time, and 10 added port processing.
(3) rpc_loop X + 2
another 20 are ports added to ovs-agent,
10 updated + 20 added port processing time
And the worse thing is, when the port number is getting larger, every port under this one security group will be related. The openflow based security group processing time is get longer and longer.
Until some instance ports meet the timeout of vif-plug. And the instance get failed to boot.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1838431/+subscriptions
References