yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1368281] [NEW] Scalability issue using neutron-linuxbridge-agent

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: mouadino <1368281@xxxxxxxxxxxxxxxxxx>
Date: Thu, 11 Sep 2014 15:05:00 -0000
Reply-to: Bug 1368281 <1368281@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Hi all,

In a cluster of 20 nodes that is configured to use neutron-linuxbridge-
agent and as part of our load test which included running a lot of
instances at once (e.g 100, 200 ... 500), we noticed the more you
increase the number of instances to launch the more the resulting
instances end up in ERROR state, as an example with 500 instance, ~250
instances end up in ERROR state.

Analyze:
=======

All the instances that end up in ERROR state are all the result of
timeout will waiting for neutron to confirm vif is plugged.

Now at this point I should note that we already increment all the
possible option that we could find in both nova and neutron e.g.
rpc_workers, rpc_response_timeout, vif_plugging_timeout, database's pool
size ... , all of this  without luck.

A further analyze in the neutron code revealed that the bottleneck is
the neutron-linuxbridge-agent that doesn't set the port to UP in time.
By digging deeper we found that the cause was the storm of RPC calls
that start by a port creation, which trigger a fanout
security_groups_member_updated sent to all neutron-linuxbridge-agent (in
our case it's around 24 agents) asking them to pull the new security
group rules for each ports (tap device) they have, which trigger in it
turn Iptables change (Did I mentioned that we use iptables security
group driver :)), the result is lock contention between all greenlets
trying to acquire the iptables driver lock ... .

Thankfully we found that this problem is not new to neutron, and as a
matter of fact, it was already fixed for ovs-agent at
https://github.com/openstack/neutron/commit/3046c4ae22b10f9e4fa83a47bfe089554d4a4681.

The same idea was implemented by us (i.e. defer applying iptables
changes), this fixed the scalability problem, number wise we was able to
lunch 500 instance at once (nova boot ... --min-count 500) all active
and reachable in 8 minutes .

A patch is coming soon.

HTH,

** Affects: neutron
     Importance: Undecided
         Status: New

** Summary changed:

- Scaling issues using neutron-linuxbridge
+ Scalability issue using neutron-linuxbridge-agent

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1368281

Title:
  Scalability issue using neutron-linuxbridge-agent

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  Hi all,

  In a cluster of 20 nodes that is configured to use neutron-
  linuxbridge-agent and as part of our load test which included running
  a lot of instances at once (e.g 100, 200 ... 500), we noticed the more
  you increase the number of instances to launch the more the resulting
  instances end up in ERROR state, as an example with 500 instance, ~250
  instances end up in ERROR state.

  Analyze:
  =======

  All the instances that end up in ERROR state are all the result of
  timeout will waiting for neutron to confirm vif is plugged.

  Now at this point I should note that we already increment all the
  possible option that we could find in both nova and neutron e.g.
  rpc_workers, rpc_response_timeout, vif_plugging_timeout, database's
  pool size ... , all of this  without luck.

  A further analyze in the neutron code revealed that the bottleneck is
  the neutron-linuxbridge-agent that doesn't set the port to UP in time.
  By digging deeper we found that the cause was the storm of RPC calls
  that start by a port creation, which trigger a fanout
  security_groups_member_updated sent to all neutron-linuxbridge-agent
  (in our case it's around 24 agents) asking them to pull the new
  security group rules for each ports (tap device) they have, which
  trigger in it turn Iptables change (Did I mentioned that we use
  iptables security group driver :)), the result is lock contention
  between all greenlets trying to acquire the iptables driver lock ... .

  Thankfully we found that this problem is not new to neutron, and as a
  matter of fact, it was already fixed for ovs-agent at
  https://github.com/openstack/neutron/commit/3046c4ae22b10f9e4fa83a47bfe089554d4a4681.

  The same idea was implemented by us (i.e. defer applying iptables
  changes), this fixed the scalability problem, number wise we was able
  to lunch 500 instance at once (nova boot ... --min-count 500) all
  active and reachable in 8 minutes .

  A patch is coming soon.

  HTH,

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1368281/+subscriptions

Follow ups

[Bug 1368281] Re: Scalability issue using neutron-linuxbridge-agent
From: Doug Hellmann, 2015-07-29
[Bug 1368281] [NEW] Scalability issue using neutron-linuxbridge-agent
From: mouadino, 2014-09-11

References

[Bug 1368281] [NEW] Scalability issue using neutron-linuxbridge-agent
From: mouadino, 2014-09-11