yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #62181
[Bug 1659760] Re: General scale issue on neutron-fwaas due to RPC broadcast usage (fanout)
Reviewed: https://review.openstack.org/444081
Committed: https://git.openstack.org/cgit/openstack/networking-midonet/commit/?id=bc33639d02a6d6a64aec63d2c15eafbb54247d61
Submitter: Jenkins
Branch: master
commit bc33639d02a6d6a64aec63d2c15eafbb54247d61
Author: YAMAMOTO Takashi <yamamoto@xxxxxxxxxxxx>
Date: Fri Mar 10 12:14:22 2017 +0900
fwaas: Add "host" argument for agent_rpc methods
This is a prepartion for the neutron-fwaas change. [1]
[1] I68cbf7403a17ddba49cc5943fb110c1d772e8834
Closes-Bug: #1659760
Change-Id: I710c7dc0f07781e5ed8deb0b91ad4889c865ce59
** Changed in: networking-midonet
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1659760
Title:
General scale issue on neutron-fwaas due to RPC broadcast usage
(fanout)
Status in networking-midonet:
Fix Released
Status in neutron:
Fix Released
Bug description:
Actually on all CRUDs methods used on FWaaS resources (Firewall, FirewallPolicy, FirewallRule, Firewallgroup, ...) an AMQP fanout cast is sent to all L3 agents.
This is a wrong design, AMPQ cast should be send only to L3Agents managing routers with firewalls related to the tenant.
This wrong design result in many bugs already reported:
1) FirewallNotFound during firewall_deleted
https://bugs.launchpad.net/neutron/+bug/1622460
https://bugs.launchpad.net/neutron/+bug/1658060
Explanation using 2 L3agents:
agent1: host router with firewall for tenant
agent2: doesn't host tenant router
1. neutron firewall-delete <firewall>
2. neutron-server send an AMQP call "delete_firewall" to agent1 and agent2
3. agent1 clean router firewall and send back "firewall_deleted" to neutron-server
4. neutron-server delete firewall resource from database
5. agent2 has nothing to clean and send back firewall_deleted to neutron-server
6. neutron-server get an exception "FirewallNotFound"
http://paste.openstack.org/raw/94663/
But this is not ended :(
7. agent2 get back the "FirewallNotfound" exception
8. on RPC error it will performed a kind of "full synchronisation" (process_services_sync)
send an AMQP call "get_tenants_with_firewalls"
9. neutron-server will respond back with a ALL tenants (even if it's not related to this agents)
10 FOR each tenant agent2 will sent a AMQP call:
get_firewalls_for_tenant()
Full sync bug is already reported here:
https://bugs.launchpad.net/neutron/+bug/1618244
2) Intermittent failed on Tempest check is probably link:
https://bugs.launchpad.net/neutron/+bug/1649703
3) More generally on FWaaS CRUDs operations neutron-server flood and is flooded by many AMQP requests.
=> this result in neutron-server RPC worker fully busy
=> AMQP messages accumulated in q-firewall-plugin queue
=> RPC Timeout appears on agents after (60s)
=> full synchronisation triggered
=> etc, etc...
To manage notifications about this bug go to:
https://bugs.launchpad.net/networking-midonet/+bug/1659760/+subscriptions
References