← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1516260] Re: L3 agent sync_routers timeouts may cause cluster to fall down

 

Reviewed:  https://review.openstack.org/234067
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0e97feb0f30bc0ef6f4fe041cb41b7aa81042263
Submitter: Jenkins
Branch:    master

commit 0e97feb0f30bc0ef6f4fe041cb41b7aa81042263
Author: Oleg Bondarev <obondarev@xxxxxxxxxxxx>
Date:   Tue Oct 13 12:45:59 2015 +0300

    L3 agent: paginate sync routers task
    
    In case there are thousands of routers attached to thousands of
    networks, sync_routers request might take a long time and lead to timeout
    on agent side, so agent initiate another resync. This may lead to an endless
    loop causing server overload and agent not being able to sync state.
    
    This patch makes l3 agent first check how many routers are assigned to
    it and then start to fetch routers by chunks.
    Initial chunk size is set to 256 but may be decreased dynamically in case
    timeouts happen while waiting response from server.
    
    This approach allows to reduce the load on server side and to speed up
    resync on agent side by starting processing right after receiving
    the first chunk.
    
    Closes-Bug: #1516260
    Change-Id: Id675910c2a0b862bfb9e6f4fdaf3cd9fe337e52f


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1516260

Title:
  L3 agent sync_routers timeouts may cause cluster to fall down

Status in neutron:
  Fix Released

Bug description:
  L3 agent 'sync_routers' RPC call is sent when the agent starts or when
  an exception occurs. It uses a default timeout of 60 seconds (An Oslo
  messaging config option). At scale the server can take a long time to
  answer, causing a timeout and the message is sent again, causing a
  cascading failure and the situation does not resolve itself. The
  sync_routers server RPC response was optimized to mitigate this, it
  could also be helpful to simply increase the timeout.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1516260/+subscriptions


References