yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #56510
[Bug 1453264] Re: [SRU] iptables_manager can run very slowly when a large number of security group rules are present
This bug was fixed in the package neutron - 1:2014.1.5-0ubuntu6
---------------
neutron (1:2014.1.5-0ubuntu6) trusty; urgency=medium
* iptables_manager can run very slowly when a large number of security group
rules are present (LP: #1453264)
- d/p/use-dictionary-for-iptables-find.patch: Use a dictionary for looking
up iptables rules rather than an iterator.
-- Billy Olsen <billy.olsen@xxxxxxxxxxxxx> Mon, 29 Aug 2016 15:06:06
-0700
** Changed in: neutron (Ubuntu Trusty)
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1453264
Title:
[SRU] iptables_manager can run very slowly when a large number of
security group rules are present
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive icehouse series:
Fix Committed
Status in neutron:
Fix Released
Status in neutron kilo series:
Fix Released
Status in neutron package in Ubuntu:
Fix Released
Status in neutron source package in Trusty:
Fix Released
Bug description:
[Impact]
We have customers that typically add a few hundred security group
rules or more. We also typically run 30+ VMs per compute node. When
about 10+ VMs with a large SG set all get scheduled to the same node,
the L2 agent (OVS) can spend many minutes in the
iptables_manager.apply() code, so much so that by the time all the
rules are updated, the VM has already tried DHCP and failed, leaving
it in an unusable state.
While there have been some patches that tried to address this in Juno
and Kilo, they've either not helped as much as necessary, or broken
SGs completely due to re-ordering the of the iptables rules.
I've been able to show some pretty bad scaling with just a handful of
VMs running in devstack based on today's code (May 8th, 2015) from
upstream Openstack.
[Test Case]
Here's what I tested:
1. I created a security group with 1000 TCP port rules (you could
alternately have a smaller number of rules and more VMs, but it's
quicker this way)
2. I booted VMs, specifying both the default and "large" SGs, and
timed from the second it took Neutron to "learn" about the port until
it completed it's work
3. I got a :( pretty quickly
And here's some data:
1-3 VM - didn't time, less than 20 seconds
4th VM - 0:36
5th VM - 0:53
6th VM - 1:11
7th VM - 1:25
8th VM - 1:48
9th VM - 2:14
While it's busy adding the rules, the OVS agent is consuming pretty
close to 100% of a CPU for most of this time (from top):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25767 stack 20 0 157936 76572 4416 R 89.2 0.5 50:14.28 python
And this is with only ~10K rules at this point! When we start
crossing the 20K point VM boot failures start to happen.
I'm filing this bug since we need to take a closer look at this in
Liberty and fix it, it's been this way since Havana and needs some
TLC.
I've attached a simple script I've used to recreate this, and will
start taking a look at options here.
[Regression Potential]
Minimal since this has been running in upstream stable for several
releases now (Kilo, Liberty, Mitaka).
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1453264/+subscriptions
References