yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #44771
[Bug 1516260] Re: L3 agent sync_routers timeouts may cause cluster to fall down
Reviewed: https://review.openstack.org/234067
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0e97feb0f30bc0ef6f4fe041cb41b7aa81042263
Submitter: Jenkins
Branch: master
commit 0e97feb0f30bc0ef6f4fe041cb41b7aa81042263
Author: Oleg Bondarev <obondarev@xxxxxxxxxxxx>
Date: Tue Oct 13 12:45:59 2015 +0300
L3 agent: paginate sync routers task
In case there are thousands of routers attached to thousands of
networks, sync_routers request might take a long time and lead to timeout
on agent side, so agent initiate another resync. This may lead to an endless
loop causing server overload and agent not being able to sync state.
This patch makes l3 agent first check how many routers are assigned to
it and then start to fetch routers by chunks.
Initial chunk size is set to 256 but may be decreased dynamically in case
timeouts happen while waiting response from server.
This approach allows to reduce the load on server side and to speed up
resync on agent side by starting processing right after receiving
the first chunk.
Closes-Bug: #1516260
Change-Id: Id675910c2a0b862bfb9e6f4fdaf3cd9fe337e52f
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1516260
Title:
L3 agent sync_routers timeouts may cause cluster to fall down
Status in neutron:
Fix Released
Bug description:
L3 agent 'sync_routers' RPC call is sent when the agent starts or when
an exception occurs. It uses a default timeout of 60 seconds (An Oslo
messaging config option). At scale the server can take a long time to
answer, causing a timeout and the message is sent again, causing a
cascading failure and the situation does not resolve itself. The
sync_routers server RPC response was optimized to mitigate this, it
could also be helpful to simply increase the timeout.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1516260/+subscriptions
References