yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #21747
[Bug 1062314] Re: do_refresh_security_group_rules in nova.virt.firewall is very slow
** No longer affects: nova/essex
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1062314
Title:
do_refresh_security_group_rules in nova.virt.firewall is very slow
Status in OpenStack Compute (Nova):
Fix Released
Status in OpenStack Compute (nova) folsom series:
Fix Released
Status in “nova” package in Ubuntu:
Fix Released
Status in “nova” source package in Quantal:
Fix Released
Bug description:
This is a bug against stable essex. I have made no attempt to
determine if this is still a problem in Folsom at this stage.
During a sprint this week we took a nova region which was previously
relatively idle and started turning up large numbers of instances
using juju. We started to experience very slow instance starts, which
I dug into. I should note that juju seems to trigger this behaviour by
refreshing security groups when ports are exposed, but other openstack
users will probably experience problems if they are trying to do non-
trivial things with security groups.
It appears that do_refresh_security_group_rules can sometimes take a
very long time to run, and it holds the "iptables" lock while doing
this. This is a problem because launching a new instance needs to take
the iptables lock, and can end up being blocked. An example slow
instance start (nova logs editted for clarity):
(logs from scheduler node)
2012-10-05 08:06:28 run_instance
2012-10-05 08:06:29 cast to <<compute node>>
(logs from compute node)
2012-10-05 08:07:21 Starting instance...
2012-10-05 08:07:34 Starting toXML method
2012-10-05 08:07:43 Finished toXML method
2012-10-05 08:07:43 Called setup_basic_filtering in nwfilter
2012-10-05 08:07:43 Ensuring static filters
2012-10-05 08:08:48 Attempting to grab semaphore "iptables" for method "_do_refresh_provider_fw_rules"...
2012-10-05 08:24:00 Got semaphore "iptables" for method "_do_refresh_provider_fw_rules"...
2012-10-05 08:24:01 Creating image
2012-10-05 08:24:06 Instance is running
2012-10-05 08:25:28 Checking state
2012-10-05 08:25:30 Instance spawned successfully.
I instrumented utils.synchronized to include lock wait and help times
like this (patch against essex):
diff --git a/nova/utils.py b/nova/utils.py
index 6535b06..2e01a15 100644
--- a/nova/utils.py
+++ b/nova/utils.py
@@ -926,10 +926,16 @@ def synchronized(name, external=False):
LOG.debug(_('Attempting to grab semaphore "%(lock)s" for method '
'"%(method)s"...') % {'lock': name,
'method': f.__name__})
+ started_waiting = time.time()
+
with sem:
LOG.debug(_('Got semaphore "%(lock)s" for method '
- '"%(method)s"...') % {'lock': name,
- 'method': f.__name__})
+ '"%(method)s" after %(wait)f second wait...'),
+ {'lock': name,
+ 'method': f.__name__,
+ 'wait': time.time() - started_waiting})
+ started_working = time.time()
+
if external and not FLAGS.disable_process_locking:
LOG.debug(_('Attempting to grab file lock "%(lock)s" for '
'method "%(method)s"...') %
@@ -945,6 +951,12 @@ def synchronized(name, external=False):
else:
retval = f(*args, **kwargs)
+ LOG.debug(_('Released semaphore "%(lock)s" for method '
+ '"%(method)s" after %(wait)f seconds of use...'),
+ {'lock': name,
+ 'method': f.__name__,
+ 'wait': time.time() - started_working})
+
# If no-one else is waiting for it, delete it.
# See note about possible raciness above.
if not sem.balance < 1:
Taking a look at the five longest lock holds in my logs after this
patch is applied, I get:
# grep "Released semaphore" /var/log/nova/nova-compute.log | grep iptables | awk '{print$15, $13}' | sort -n | tail -5
192.134270 "do_refresh_security_group_rules"
194.140478 "do_refresh_security_group_rules"
194.153729 "do_refresh_security_group_rules"
201.135854 "do_refresh_security_group_rules"
297.725837 "do_refresh_security_group_rules"
So I then instrumented do_refresh_security_group_rules to try and see
what was slow. I used this patch (which I know is horrible):
diff --git a/nova/virt/firewall.py b/nova/virt/firewall.py
index f0f1594..99f580a 100644
--- a/nova/virt/firewall.py
+++ b/nova/virt/firewall.py
@@ -17,6 +17,8 @@
# License for the specific language governing permissions and limitations
# under the License.
+import time
+
from nova import context
from nova import db
from nova import flags
@@ -167,16 +169,35 @@ class IptablesFirewallDriver(FirewallDriver):
self.iptables.ipv6['filter'].add_rule(chain_name, rule)
def add_filters_for_instance(self, instance):
+ start_time = time.time()
network_info = self.network_infos[instance['id']]
+ LOG.debug(_('Get network info took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
chain_name = self._instance_chain_name(instance)
+ LOG.debug(_('Get chain name took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
if FLAGS.use_ipv6:
self.iptables.ipv6['filter'].add_chain(chain_name)
self.iptables.ipv4['filter'].add_chain(chain_name)
+ LOG.debug(_('Add chain took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
ipv4_rules, ipv6_rules = self._filters_for_instance(chain_name,
network_info)
+ LOG.debug(_('Filters for instance took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
self._add_filters('local', ipv4_rules, ipv6_rules)
ipv4_rules, ipv6_rules = self.instance_rules(instance, network_info)
self._add_filters(chain_name, ipv4_rules, ipv6_rules)
+ LOG.debug(_('Add filters took %f seconds'),
+ time.time() - start_time)
def remove_filters_for_instance(self, instance):
chain_name = self._instance_chain_name(instance)
@@ -362,9 +383,17 @@ class IptablesFirewallDriver(FirewallDriver):
@utils.synchronized('iptables', external=True)
def do_refresh_security_group_rules(self, security_group):
+ # TODO(mikal): why is security group passed in if its not used?
for instance in self.instances.values():
+ remove_start = time.time()
self.remove_filters_for_instance(instance)
+ add_start = time.time()
self.add_filters_for_instance(instance)
+ LOG.debug(_('Refreshing security groups for instance (removal '
+ 'took %(remove)f seconds, add took %(add)f seconds)'),
+ {'remove': add_start - remove_start,
+ 'add': time.time() - add_start},
+ instance=instance)
def refresh_provider_fw_rules(self):
"""See :class:`FirewallDriver` docs."""
This gives me log entries like this (not all are this slow):
2012-10-05 13:27:43 DEBUG nova.virt.firewall [req-051d71e4-1a71-43f9-87a4-36686b5445c6 None None] Add filters took 46.641673 seconds from (pid=8052) add_filters_for_instance /usr/lib/python2.7/dist-packages/nova/virt/firewall.py:200
2012-10-05 13:27:43 DEBUG nova.virt.firewall [req-051d71e4-1a71-43f9-87a4-36686b5445c6 None None] [instance: 06ea0120-ed5e-4379-90b1-322e1de68950] Refreshing security groups for instance (removal took 0.000958 seconds, add took 46.643727 seconds) from (pid=8052) do_refresh_security_group_rules /usr/lib/python2.7/dist-packages/nova/virt/firewall.py:396
I'm still looking at this, but it looks like nova.virt.firewall.py
IptablesFirewallDriver.instance_rules() is the problem. This method
makes both database and rpc calls while holding the iptables lock.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1062314/+subscriptions