yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1692971] [NEW] neutron operators using L3 agent might need to tune SYNC_ROUTERS_MAX_CHUNK_SIZE and SYNC_ROUTERS_MIN_CHUNK_SIZE

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Cristian Calin <1692971@xxxxxxxxxxxxxxxxxx>
Date: Tue, 23 May 2017 16:00:42 -0000
Reply-to: Bug 1692971 <1692971@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Summary
======= 
Openstack operators deploying the L3 agent might need to tune the SYNC_ROUTERS_MIN/MAX_CHUNK_SIZE parameters to avoid flooding the neutron-server

High level description
======================
neutron l3 agent and its derivatives (neutron-vpn-agent) will do a full sync when they start. The process is to fetch a list of associated routers from the neutron-server then issue a sync_routers RPC call with the difference delta of what they have online and what they need to synchronise.
The call time is linear dependent on the number of routers associated to that agent and might result in rpc timeout if the server is overloaded (like in the situation of a complete datacenter outage or a multi-step upgrade). The l3 agent will attempt to chunk out the call if oslo_messaging.MessagingTimeout is caught but by the time it eventually scales down the chunk size the server may be already swamped with calls and will take a considerable time to start onlining routers.

Pre-conditions
==============
We faced this issue in a production environment and managed to reproduce an approximate behaviour in a pre-production environment.

Details of the test environment:
* 4 instances of the neutron-server
 - 8 RPC workers
 - 8 API workers
 - 700 networks with 1 subnetwork each
 - 100 tenants
 - 9 external networks
 - 1 shared network with instances attached to it
* 6 neutron vpn agents (also tested with neutron-l3-agent)
 - L3 HA configured
 - no l2-population configured
 - 240 routers scheduled per agent
 - rpc_timeout = 600
* 3 nova-compute nodes
 - running 600 instances
 - 100 instances with 2 network interfaces
 - 50 instances attached to the shared network

Observations:
* sync_routers RPC call takes 7-10 minutes to get processed
* in production we observe messaging timeout and chunk scaling after 40 minutes
* in this environment we don't see RPC timeout but still the sync_routers call would exceed the rpc_timeout of 60 and would trigger neutron-server to consume 100% CPU for almost 40 minutes before eventually scaling down the chunk size and managing to fully online all the routers

Modifications:
We modified neutron/agent/l3/agent.py on the L3 agent nodes and set:
SYNC_ROUTERS_MAX_CHUNK_SIZE = 32
SYNC_ROUTERS_MIN_CHUNK_SIZE = 8
... this resulted in the neutron-l3-agent starting to create qrouter-* namespaces after 10 seconds from a clean restart.
Clean restart for this test is to kill all keepalived and neutron agent processes, delete ovs ports and delete all namespaces from the node. This effectively ensures a full clean resync.

Versions tested:
* stable/mitaka (head)
* 8.4.0 tag
* 8.3.0 tag
I checked the code and the logic is the same in master so I don't expect much improvement with newton or ocata.

I want to propose we make these hardcoded values operator parametrisable while keeping the current defaults. It would not change the behaviour of the code for anybody except for operators which need to adjust these values and not require us to keep private patches.
I have a working patch set I can submit upstream for this which should be back portable all the way back to mitaka.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: neutron neutron-l3-agent neutron-von-agent rpc slow sync-routers

** Tags added: neutron-l3-agent

** Tags added: neutron-von-agent

** Tags added: rpc slow

** Tags added: sync-routers

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1692971

Title:
  neutron operators using L3 agent might need to tune
  SYNC_ROUTERS_MAX_CHUNK_SIZE and SYNC_ROUTERS_MIN_CHUNK_SIZE

Status in neutron:
  New

Bug description:
  Summary
  ======= 
  Openstack operators deploying the L3 agent might need to tune the SYNC_ROUTERS_MIN/MAX_CHUNK_SIZE parameters to avoid flooding the neutron-server

  High level description
  ======================
  neutron l3 agent and its derivatives (neutron-vpn-agent) will do a full sync when they start. The process is to fetch a list of associated routers from the neutron-server then issue a sync_routers RPC call with the difference delta of what they have online and what they need to synchronise.
  The call time is linear dependent on the number of routers associated to that agent and might result in rpc timeout if the server is overloaded (like in the situation of a complete datacenter outage or a multi-step upgrade). The l3 agent will attempt to chunk out the call if oslo_messaging.MessagingTimeout is caught but by the time it eventually scales down the chunk size the server may be already swamped with calls and will take a considerable time to start onlining routers.

  Pre-conditions
  ==============
  We faced this issue in a production environment and managed to reproduce an approximate behaviour in a pre-production environment.

  Details of the test environment:
  * 4 instances of the neutron-server
   - 8 RPC workers
   - 8 API workers
   - 700 networks with 1 subnetwork each
   - 100 tenants
   - 9 external networks
   - 1 shared network with instances attached to it
  * 6 neutron vpn agents (also tested with neutron-l3-agent)
   - L3 HA configured
   - no l2-population configured
   - 240 routers scheduled per agent
   - rpc_timeout = 600
  * 3 nova-compute nodes
   - running 600 instances
   - 100 instances with 2 network interfaces
   - 50 instances attached to the shared network

  Observations:
  * sync_routers RPC call takes 7-10 minutes to get processed
  * in production we observe messaging timeout and chunk scaling after 40 minutes
  * in this environment we don't see RPC timeout but still the sync_routers call would exceed the rpc_timeout of 60 and would trigger neutron-server to consume 100% CPU for almost 40 minutes before eventually scaling down the chunk size and managing to fully online all the routers

  Modifications:
  We modified neutron/agent/l3/agent.py on the L3 agent nodes and set:
  SYNC_ROUTERS_MAX_CHUNK_SIZE = 32
  SYNC_ROUTERS_MIN_CHUNK_SIZE = 8
  ... this resulted in the neutron-l3-agent starting to create qrouter-* namespaces after 10 seconds from a clean restart.
  Clean restart for this test is to kill all keepalived and neutron agent processes, delete ovs ports and delete all namespaces from the node. This effectively ensures a full clean resync.

  Versions tested:
  * stable/mitaka (head)
  * 8.4.0 tag
  * 8.3.0 tag
  I checked the code and the logic is the same in master so I don't expect much improvement with newton or ocata.

  I want to propose we make these hardcoded values operator parametrisable while keeping the current defaults. It would not change the behaviour of the code for anybody except for operators which need to adjust these values and not require us to keep private patches.
  I have a working patch set I can submit upstream for this which should be back portable all the way back to mitaka.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1692971/+subscriptions