yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #62097
[Bug 1659760] Re: General scale issue on neutron-fwaas due to RPC broadcast usage (fanout)
Reviewed: https://review.openstack.org/426287
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=da425fd913c168d5477588df5d8574fce21e2eb7
Submitter: Jenkins
Branch: master
commit da425fd913c168d5477588df5d8574fce21e2eb7
Author: Bertrand Lallau <bertrand.lallau@xxxxxxxxxxxxxxx>
Date: Fri Jan 27 16:52:09 2017 +0100
Fix RPC scale issue using cast instead of fanout v1
Actually all CRUDs methods used on FWaaS v1 resources (Firewall,
FirewallPolicy, FirewallRule) results on AMQP fanout cast requests
sent to all L3 agents (even if they don't have routers or firewalls).
This fix send AMQP cast only to L3 agents affected by the corresponding
firewall.
Such trouble also impacts FWaaS v2 and will be solved in a follow-up
change.
Change-Id: Id6cb991aee959319997bb15ece240c09d4ac5e39
Closes-Bug: #1659760
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1659760
Title:
General scale issue on neutron-fwaas due to RPC broadcast usage
(fanout)
Status in neutron:
Fix Released
Bug description:
Actually on all CRUDs methods used on FWaaS resources (Firewall, FirewallPolicy, FirewallRule, Firewallgroup, ...) an AMQP fanout cast is sent to all L3 agents.
This is a wrong design, AMPQ cast should be send only to L3Agents managing routers with firewalls related to the tenant.
This wrong design result in many bugs already reported:
1) FirewallNotFound during firewall_deleted
https://bugs.launchpad.net/neutron/+bug/1622460
https://bugs.launchpad.net/neutron/+bug/1658060
Explanation using 2 L3agents:
agent1: host router with firewall for tenant
agent2: doesn't host tenant router
1. neutron firewall-delete <firewall>
2. neutron-server send an AMQP call "delete_firewall" to agent1 and agent2
3. agent1 clean router firewall and send back "firewall_deleted" to neutron-server
4. neutron-server delete firewall resource from database
5. agent2 has nothing to clean and send back firewall_deleted to neutron-server
6. neutron-server get an exception "FirewallNotFound"
http://paste.openstack.org/raw/94663/
But this is not ended :(
7. agent2 get back the "FirewallNotfound" exception
8. on RPC error it will performed a kind of "full synchronisation" (process_services_sync)
send an AMQP call "get_tenants_with_firewalls"
9. neutron-server will respond back with a ALL tenants (even if it's not related to this agents)
10 FOR each tenant agent2 will sent a AMQP call:
get_firewalls_for_tenant()
Full sync bug is already reported here:
https://bugs.launchpad.net/neutron/+bug/1618244
2) Intermittent failed on Tempest check is probably link:
https://bugs.launchpad.net/neutron/+bug/1649703
3) More generally on FWaaS CRUDs operations neutron-server flood and is flooded by many AMQP requests.
=> this result in neutron-server RPC worker fully busy
=> AMQP messages accumulated in q-firewall-plugin queue
=> RPC Timeout appears on agents after (60s)
=> full synchronisation triggered
=> etc, etc...
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1659760/+subscriptions
References