yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #36122
[Bug 1368281] Re: Scalability issue using neutron-linuxbridge-agent
** Changed in: neutron
Status: Fix Committed => Fix Released
** Changed in: neutron
Milestone: None => liberty-2
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1368281
Title:
Scalability issue using neutron-linuxbridge-agent
Status in neutron:
Fix Released
Bug description:
Hi all,
In a cluster of 20 nodes that is configured to use neutron-
linuxbridge-agent and as part of our load test which included running
a lot of instances at once (e.g 100, 200 ... 500), we noticed the more
you increase the number of instances to launch the more the resulting
instances end up in ERROR state, as an example with 500 instance, ~250
instances end up in ERROR state.
Analyze:
=======
All the instances that end up in ERROR state are all the result of
timeout while waiting for neutron to confirm vif is plugged.
Now at this point I should note that we already increment all the
possible option that we could find in both nova and neutron e.g.
rpc_workers, rpc_response_timeout, vif_plugging_timeout, SA pool_size
... , all of this without luck.
A further analyze in the neutron code revealed that the bottleneck is
the neutron-linuxbridge-agent that doesn't set the port to UP in time.
By digging deeper we found that the cause was the storm of RPC calls
that start by a port creation, which trigger sending a fanout of the
message security_groups_member_updated to all neutron-linuxbridge-
agent (in our case it's around 24 agents) asking them to pull the new
security group rules for each ports (tap device) they have, which
trigger in it turn Iptables change (Did I mentioned that we use
iptables security group driver :)), the result is lock contention
between all greenlets trying to acquire the iptables driver lock ... .
Thankfully we found that this problem is not new to neutron, and as a
matter of fact, it was already fixed for ovs-agent at
https://github.com/openstack/neutron/commit/3046c4ae22b10f9e4fa83a47bfe089554d4a4681.
The same idea was implemented by us (i.e. defer applying iptables
changes), this fixed the scalability problem, number wise we was able
to lunch 500 instance at once (nova boot ... --min-count 500) all
active and reachable in 8 minutes .
A patch is coming soon.
HTH,
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1368281/+subscriptions
References