yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #20955
[Bug 1368281] [NEW] Scalability issue using neutron-linuxbridge-agent
Public bug reported:
Hi all,
In a cluster of 20 nodes that is configured to use neutron-linuxbridge-
agent and as part of our load test which included running a lot of
instances at once (e.g 100, 200 ... 500), we noticed the more you
increase the number of instances to launch the more the resulting
instances end up in ERROR state, as an example with 500 instance, ~250
instances end up in ERROR state.
Analyze:
=======
All the instances that end up in ERROR state are all the result of
timeout will waiting for neutron to confirm vif is plugged.
Now at this point I should note that we already increment all the
possible option that we could find in both nova and neutron e.g.
rpc_workers, rpc_response_timeout, vif_plugging_timeout, database's pool
size ... , all of this without luck.
A further analyze in the neutron code revealed that the bottleneck is
the neutron-linuxbridge-agent that doesn't set the port to UP in time.
By digging deeper we found that the cause was the storm of RPC calls
that start by a port creation, which trigger a fanout
security_groups_member_updated sent to all neutron-linuxbridge-agent (in
our case it's around 24 agents) asking them to pull the new security
group rules for each ports (tap device) they have, which trigger in it
turn Iptables change (Did I mentioned that we use iptables security
group driver :)), the result is lock contention between all greenlets
trying to acquire the iptables driver lock ... .
Thankfully we found that this problem is not new to neutron, and as a
matter of fact, it was already fixed for ovs-agent at
https://github.com/openstack/neutron/commit/3046c4ae22b10f9e4fa83a47bfe089554d4a4681.
The same idea was implemented by us (i.e. defer applying iptables
changes), this fixed the scalability problem, number wise we was able to
lunch 500 instance at once (nova boot ... --min-count 500) all active
and reachable in 8 minutes .
A patch is coming soon.
HTH,
** Affects: neutron
Importance: Undecided
Status: New
** Summary changed:
- Scaling issues using neutron-linuxbridge
+ Scalability issue using neutron-linuxbridge-agent
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1368281
Title:
Scalability issue using neutron-linuxbridge-agent
Status in OpenStack Neutron (virtual network service):
New
Bug description:
Hi all,
In a cluster of 20 nodes that is configured to use neutron-
linuxbridge-agent and as part of our load test which included running
a lot of instances at once (e.g 100, 200 ... 500), we noticed the more
you increase the number of instances to launch the more the resulting
instances end up in ERROR state, as an example with 500 instance, ~250
instances end up in ERROR state.
Analyze:
=======
All the instances that end up in ERROR state are all the result of
timeout will waiting for neutron to confirm vif is plugged.
Now at this point I should note that we already increment all the
possible option that we could find in both nova and neutron e.g.
rpc_workers, rpc_response_timeout, vif_plugging_timeout, database's
pool size ... , all of this without luck.
A further analyze in the neutron code revealed that the bottleneck is
the neutron-linuxbridge-agent that doesn't set the port to UP in time.
By digging deeper we found that the cause was the storm of RPC calls
that start by a port creation, which trigger a fanout
security_groups_member_updated sent to all neutron-linuxbridge-agent
(in our case it's around 24 agents) asking them to pull the new
security group rules for each ports (tap device) they have, which
trigger in it turn Iptables change (Did I mentioned that we use
iptables security group driver :)), the result is lock contention
between all greenlets trying to acquire the iptables driver lock ... .
Thankfully we found that this problem is not new to neutron, and as a
matter of fact, it was already fixed for ovs-agent at
https://github.com/openstack/neutron/commit/3046c4ae22b10f9e4fa83a47bfe089554d4a4681.
The same idea was implemented by us (i.e. defer applying iptables
changes), this fixed the scalability problem, number wise we was able
to lunch 500 instance at once (nova boot ... --min-count 500) all
active and reachable in 8 minutes .
A patch is coming soon.
HTH,
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1368281/+subscriptions
Follow ups
References