← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1659760] Re: General scale issue on neutron-fwaas due to RPC broadcast usage (fanout)

 

Reviewed:  https://review.openstack.org/426287
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=da425fd913c168d5477588df5d8574fce21e2eb7
Submitter: Jenkins
Branch:    master

commit da425fd913c168d5477588df5d8574fce21e2eb7
Author: Bertrand Lallau <bertrand.lallau@xxxxxxxxxxxxxxx>
Date:   Fri Jan 27 16:52:09 2017 +0100

    Fix RPC scale issue using cast instead of fanout v1
    
    Actually all CRUDs methods used on FWaaS v1 resources (Firewall,
    FirewallPolicy, FirewallRule) results on AMQP fanout cast requests
    sent to all L3 agents (even if they don't have routers or firewalls).
    
    This fix send AMQP cast only to L3 agents affected by the corresponding
    firewall.
    
    Such trouble also impacts FWaaS v2 and will be solved in a follow-up
    change.
    
    Change-Id: Id6cb991aee959319997bb15ece240c09d4ac5e39
    Closes-Bug: #1659760


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1659760

Title:
  General scale issue on neutron-fwaas due to RPC broadcast usage
  (fanout)

Status in neutron:
  Fix Released

Bug description:
  Actually on all CRUDs methods used on FWaaS resources (Firewall, FirewallPolicy, FirewallRule, Firewallgroup, ...) an AMQP fanout cast is sent to all L3 agents.
  This is a wrong design, AMPQ cast should be send only to L3Agents managing routers with firewalls related to the tenant.

  This wrong design result in many bugs already reported:

  1) FirewallNotFound during firewall_deleted
  https://bugs.launchpad.net/neutron/+bug/1622460
  https://bugs.launchpad.net/neutron/+bug/1658060

  Explanation using 2 L3agents:
  agent1: host router with firewall for tenant
  agent2: doesn't host tenant router

    1. neutron firewall-delete <firewall>
    2. neutron-server send an AMQP call "delete_firewall" to agent1 and agent2
    3. agent1 clean router firewall and send back "firewall_deleted" to neutron-server
    4. neutron-server delete firewall resource from database
    5. agent2 has nothing to clean and send back firewall_deleted to neutron-server
    6. neutron-server get an exception "FirewallNotFound"
       http://paste.openstack.org/raw/94663/

    But this is not ended :(
    7. agent2 get back the "FirewallNotfound" exception
    8. on RPC error it will performed a kind of "full synchronisation" (process_services_sync)
       send an AMQP call "get_tenants_with_firewalls"
    9. neutron-server will respond back with a ALL tenants (even if it's not related to this agents)
    10 FOR each tenant agent2 will sent a AMQP call:
       get_firewalls_for_tenant()

  Full sync bug is already reported here:
  https://bugs.launchpad.net/neutron/+bug/1618244

  2) Intermittent failed on Tempest check is probably link:
  https://bugs.launchpad.net/neutron/+bug/1649703

  3) More generally on FWaaS CRUDs operations neutron-server flood and is flooded by many AMQP requests.
  => this result in neutron-server RPC worker fully busy
  => AMQP messages accumulated in q-firewall-plugin queue
  => RPC Timeout appears on agents after (60s)
  => full synchronisation triggered
  => etc, etc...

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1659760/+subscriptions


References