yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1368281] Re: Scalability issue using neutron-linuxbridge-agent

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Doug Hellmann <doug@xxxxxxxxxxxxxxxx>
Date: Wed, 29 Jul 2015 18:58:54 -0000
Reply-to: Bug 1368281 <1368281@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

** Changed in: neutron
       Status: Fix Committed => Fix Released

** Changed in: neutron
    Milestone: None => liberty-2

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1368281

Title:
  Scalability issue using neutron-linuxbridge-agent

Status in neutron:
  Fix Released

Bug description:
  Hi all,

  In a cluster of 20 nodes that is configured to use neutron-
  linuxbridge-agent and as part of our load test which included running
  a lot of instances at once (e.g 100, 200 ... 500), we noticed the more
  you increase the number of instances to launch the more the resulting
  instances end up in ERROR state, as an example with 500 instance, ~250
  instances end up in ERROR state.

  Analyze:
  =======

  All the instances that end up in ERROR state are all the result of
  timeout while waiting for neutron to confirm vif is plugged.

  Now at this point I should note that we already increment all the
  possible option that we could find in both nova and neutron e.g.
  rpc_workers, rpc_response_timeout, vif_plugging_timeout, SA pool_size
  ... , all of this  without luck.

  A further analyze in the neutron code revealed that the bottleneck is
  the neutron-linuxbridge-agent that doesn't set the port to UP in time.
  By digging deeper we found that the cause was the storm of RPC calls
  that start by a port creation, which trigger sending a fanout of the
  message security_groups_member_updated to all neutron-linuxbridge-
  agent (in our case it's around 24 agents) asking them to pull the new
  security group rules for each ports (tap device) they have, which
  trigger in it turn Iptables change (Did I mentioned that we use
  iptables security group driver :)), the result is lock contention
  between all greenlets trying to acquire the iptables driver lock ... .

  Thankfully we found that this problem is not new to neutron, and as a
  matter of fact, it was already fixed for ovs-agent at
  https://github.com/openstack/neutron/commit/3046c4ae22b10f9e4fa83a47bfe089554d4a4681.

  The same idea was implemented by us (i.e. defer applying iptables
  changes), this fixed the scalability problem, number wise we was able
  to lunch 500 instance at once (nova boot ... --min-count 500) all
  active and reachable in 8 minutes .

  A patch is coming soon.

  HTH,

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1368281/+subscriptions

References

[Bug 1368281] [NEW] Scalability issue using neutron-linuxbridge-agent
From: mouadino, 2014-09-11