yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1062314] Re: do_refresh_security_group_rules in nova.virt.firewall is very slow

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Sean Dague <sean@xxxxxxxxx>
Date: Fri, 19 Sep 2014 10:56:14 -0000
Reply-to: Bug 1062314 <1062314@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** No longer affects: nova/essex

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1062314

Title:
  do_refresh_security_group_rules in nova.virt.firewall is very slow

Status in OpenStack Compute (Nova):
  Fix Released
Status in OpenStack Compute (nova) folsom series:
  Fix Released
Status in “nova” package in Ubuntu:
  Fix Released
Status in “nova” source package in Quantal:
  Fix Released

Bug description:
  This is a bug against stable essex. I have made no attempt to
  determine if this is still a problem in Folsom at this stage.

  During a sprint this week we took a nova region which was previously
  relatively idle and started turning up large numbers of instances
  using juju. We started to experience very slow instance starts, which
  I dug into. I should note that juju seems to trigger this behaviour by
  refreshing security groups when ports are exposed, but other openstack
  users will probably experience problems if they are trying to do non-
  trivial things with security groups.

  It appears that do_refresh_security_group_rules can sometimes take a
  very long time to run, and it holds the "iptables" lock while doing
  this. This is a problem because launching a new instance needs to take
  the iptables lock, and can end up being blocked. An example slow
  instance start (nova logs editted for clarity):

  (logs from scheduler node)
  2012-10-05 08:06:28 run_instance
  2012-10-05 08:06:29 cast to <<compute node>>

  (logs from compute node)
  2012-10-05 08:07:21 Starting instance...
  2012-10-05 08:07:34 Starting toXML method
  2012-10-05 08:07:43 Finished toXML method
  2012-10-05 08:07:43 Called setup_basic_filtering in nwfilter
  2012-10-05 08:07:43 Ensuring static filters

  2012-10-05 08:08:48 Attempting to grab semaphore "iptables" for method "_do_refresh_provider_fw_rules"...
  2012-10-05 08:24:00 Got semaphore "iptables" for method "_do_refresh_provider_fw_rules"...

  2012-10-05 08:24:01 Creating image
  2012-10-05 08:24:06 Instance is running
  2012-10-05 08:25:28 Checking state
  2012-10-05 08:25:30 Instance spawned successfully.

  I instrumented utils.synchronized to include lock wait and help times
  like this (patch against essex):

  diff --git a/nova/utils.py b/nova/utils.py
  index 6535b06..2e01a15 100644
  --- a/nova/utils.py
  +++ b/nova/utils.py
  @@ -926,10 +926,16 @@ def synchronized(name, external=False):
               LOG.debug(_('Attempting to grab semaphore "%(lock)s" for method '
                           '"%(method)s"...') % {'lock': name,
                                                 'method': f.__name__})
  +            started_waiting = time.time()
  +
               with sem:
                   LOG.debug(_('Got semaphore "%(lock)s" for method '
  -                            '"%(method)s"...') % {'lock': name,
  -                                                  'method': f.__name__})
  +                            '"%(method)s" after %(wait)f second wait...'),
  +                          {'lock': name,
  +                           'method': f.__name__,
  +                           'wait': time.time() - started_waiting})
  +                started_working = time.time()
  +
                   if external and not FLAGS.disable_process_locking:
                       LOG.debug(_('Attempting to grab file lock "%(lock)s" for '
                                   'method "%(method)s"...') %
  @@ -945,6 +951,12 @@ def synchronized(name, external=False):
                   else:
                       retval = f(*args, **kwargs)
   
  +            LOG.debug(_('Released semaphore "%(lock)s" for method '
  +                        '"%(method)s" after %(wait)f seconds of use...'),
  +                      {'lock': name,
  +                       'method': f.__name__,
  +                       'wait': time.time() - started_working})
  +
               # If no-one else is waiting for it, delete it.
               # See note about possible raciness above.
               if not sem.balance < 1:

  Taking a look at the five longest lock holds in my logs after this
  patch is applied, I get:

  # grep "Released semaphore" /var/log/nova/nova-compute.log | grep iptables | awk '{print$15, $13}' | sort -n | tail -5
  192.134270 "do_refresh_security_group_rules"
  194.140478 "do_refresh_security_group_rules"
  194.153729 "do_refresh_security_group_rules"
  201.135854 "do_refresh_security_group_rules"
  297.725837 "do_refresh_security_group_rules"

  So I then instrumented do_refresh_security_group_rules to try and see
  what was slow. I used this patch (which I know is horrible):

  diff --git a/nova/virt/firewall.py b/nova/virt/firewall.py
  index f0f1594..99f580a 100644
  --- a/nova/virt/firewall.py
  +++ b/nova/virt/firewall.py
  @@ -17,6 +17,8 @@
   #    License for the specific language governing permissions and limitations
   #    under the License.
   
  +import time
  +
   from nova import context
   from nova import db
   from nova import flags
  @@ -167,16 +169,35 @@ class IptablesFirewallDriver(FirewallDriver):
                   self.iptables.ipv6['filter'].add_rule(chain_name, rule)
   
       def add_filters_for_instance(self, instance):
  +        start_time = time.time()
           network_info = self.network_infos[instance['id']]
  +        LOG.debug(_('Get network info took %f seconds'),
  +                  time.time() - start_time)
  +
  +        start_time = time.time()
           chain_name = self._instance_chain_name(instance)
  +        LOG.debug(_('Get chain name took %f seconds'),
  +                  time.time() - start_time)
  +
  +        start_time = time.time()
           if FLAGS.use_ipv6:
               self.iptables.ipv6['filter'].add_chain(chain_name)
           self.iptables.ipv4['filter'].add_chain(chain_name)
  +        LOG.debug(_('Add chain took %f seconds'),
  +                  time.time() - start_time)
  +
  +        start_time = time.time()
           ipv4_rules, ipv6_rules = self._filters_for_instance(chain_name,
                                                               network_info)
  +        LOG.debug(_('Filters for instance took %f seconds'),
  +                  time.time() - start_time)
  +
  +        start_time = time.time()
           self._add_filters('local', ipv4_rules, ipv6_rules)
           ipv4_rules, ipv6_rules = self.instance_rules(instance, network_info)
           self._add_filters(chain_name, ipv4_rules, ipv6_rules)
  +        LOG.debug(_('Add filters took %f seconds'),
  +                  time.time() - start_time)
   
       def remove_filters_for_instance(self, instance):
           chain_name = self._instance_chain_name(instance)
  @@ -362,9 +383,17 @@ class IptablesFirewallDriver(FirewallDriver):
   
       @utils.synchronized('iptables', external=True)
       def do_refresh_security_group_rules(self, security_group):
  +        # TODO(mikal): why is security group passed in if its not used?
           for instance in self.instances.values():
  +            remove_start = time.time()
               self.remove_filters_for_instance(instance)
  +            add_start = time.time()
               self.add_filters_for_instance(instance)
  +            LOG.debug(_('Refreshing security groups for instance (removal '
  +                        'took %(remove)f seconds, add took %(add)f seconds)'),
  +                      {'remove': add_start - remove_start,
  +                       'add': time.time() - add_start},
  +                      instance=instance)
   
       def refresh_provider_fw_rules(self):
           """See :class:`FirewallDriver` docs."""

  This gives me log entries like this (not all are this slow):

  2012-10-05 13:27:43 DEBUG nova.virt.firewall [req-051d71e4-1a71-43f9-87a4-36686b5445c6 None None] Add filters took 46.641673 seconds from (pid=8052) add_filters_for_instance /usr/lib/python2.7/dist-packages/nova/virt/firewall.py:200
  2012-10-05 13:27:43 DEBUG nova.virt.firewall [req-051d71e4-1a71-43f9-87a4-36686b5445c6 None None] [instance: 06ea0120-ed5e-4379-90b1-322e1de68950] Refreshing security groups for instance (removal took 0.000958 seconds, add took 46.643727 seconds) from (pid=8052) do_refresh_security_group_rules /usr/lib/python2.7/dist-packages/nova/virt/firewall.py:396

  I'm still looking at this, but it looks like nova.virt.firewall.py
  IptablesFirewallDriver.instance_rules() is the problem. This method
  makes both database and rpc calls while holding the iptables lock.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1062314/+subscriptions