← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1453264] [NEW] iptables_manager can run very slowly when a large number of security group rules are present

 

Public bug reported:

We have customers that typically add a few hundred security group rules
or more.  We also typically run 30+ VMs per compute node.  When about
10+ VMs with a large SG set all get scheduled to the same node, the L2
agent (OVS) can spend many minutes in the iptables_manager.apply() code,
so much so that by the time all the rules are updated, the VM has
already tried DHCP and failed, leaving it in an unusable state.

While there have been some patches that tried to address this in Juno
and Kilo, they've either not helped as much as necessary, or broken SGs
completely due to re-ordering the of the iptables rules.

I've been able to show some pretty bad scaling with just a handful of
VMs running in devstack based on today's code (May 8th, 2015) from
upstream Openstack.

Here's what I tested:

1. I created a security group with 1000 TCP port rules (you could
alternately have a smaller number of rules and more VMs, but it's
quicker this way)

2. I booted VMs, specifying both the default and "large" SGs, and timed
from the second it took Neutron to "learn" about the port until it
completed it's work

3. I got a :( pretty quickly

And here's some data:

1-3 VM - didn't time, less than 20 seconds
4th VM - 0:36
5th VM - 0:53
6th VM - 1:11
7th VM - 1:25
8th VM - 1:48
9th VM - 2:14

While it's busy adding the rules, the OVS agent is consuming pretty
close to 100% of a CPU for most of this time (from top):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
25767 stack     20   0  157936  76572   4416 R  89.2  0.5  50:14.28 python

And this is with only ~10K rules at this point!  When we start crossing
the 20K point VM boot failures start to happen.

I'm filing this bug since we need to take a closer look at this in
Liberty and fix it, it's been this way since Havana and needs some TLC.

I've attached a simple script I've used to recreate this, and will start
taking a look at options here.

** Affects: neutron
     Importance: Undecided
     Assignee: Brian Haley (brian-haley)
         Status: New

** Attachment added: "Script to add 1000 security group rules"
   https://bugs.launchpad.net/bugs/1453264/+attachment/4393817/+files/big-sec-rules.sh

** Changed in: neutron
     Assignee: (unassigned) => Brian Haley (brian-haley)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1453264

Title:
  iptables_manager can run very slowly when a large number of security
  group rules are present

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  We have customers that typically add a few hundred security group
  rules or more.  We also typically run 30+ VMs per compute node.  When
  about 10+ VMs with a large SG set all get scheduled to the same node,
  the L2 agent (OVS) can spend many minutes in the
  iptables_manager.apply() code, so much so that by the time all the
  rules are updated, the VM has already tried DHCP and failed, leaving
  it in an unusable state.

  While there have been some patches that tried to address this in Juno
  and Kilo, they've either not helped as much as necessary, or broken
  SGs completely due to re-ordering the of the iptables rules.

  I've been able to show some pretty bad scaling with just a handful of
  VMs running in devstack based on today's code (May 8th, 2015) from
  upstream Openstack.

  Here's what I tested:

  1. I created a security group with 1000 TCP port rules (you could
  alternately have a smaller number of rules and more VMs, but it's
  quicker this way)

  2. I booted VMs, specifying both the default and "large" SGs, and
  timed from the second it took Neutron to "learn" about the port until
  it completed it's work

  3. I got a :( pretty quickly

  And here's some data:

  1-3 VM - didn't time, less than 20 seconds
  4th VM - 0:36
  5th VM - 0:53
  6th VM - 1:11
  7th VM - 1:25
  8th VM - 1:48
  9th VM - 2:14

  While it's busy adding the rules, the OVS agent is consuming pretty
  close to 100% of a CPU for most of this time (from top):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
  25767 stack     20   0  157936  76572   4416 R  89.2  0.5  50:14.28 python

  And this is with only ~10K rules at this point!  When we start
  crossing the 20K point VM boot failures start to happen.

  I'm filing this bug since we need to take a closer look at this in
  Liberty and fix it, it's been this way since Havana and needs some
  TLC.

  I've attached a simple script I've used to recreate this, and will
  start taking a look at options here.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1453264/+subscriptions


Follow ups

References