yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #94365
[Bug 2074209] Re: OVN maintenance tasks may be delayed 10 minutes in the podified deployment
Reviewed: https://review.opendev.org/c/openstack/neutron/+/925194
Committed: https://opendev.org/openstack/neutron/commit/04c217bcd0eda07d52a60121b6f86236ba6e26ee
Submitter: "Zuul (22348)"
Branch: master
commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee
Author: Slawek Kaplonski <skaplons@xxxxxxxxxx>
Date: Tue Jul 30 14:17:44 2024 +0200
Lower spacing time of the OVN maintenance tasks which should be run once
Some of the OVN maintenance tasks are expected to be run just once and
then they raise periodic.NeverAgain() to not be run anymore. Those tasks
also require to have acquried ovn db lock so that only one of the
maintenance workers really runs them.
All those tasks had set 600 seconds as a spacing time so they were run
every 600 seconds. This works fine usually but that may cause small
issue in the environments were Neutron is run in POD as k8s/openshift
application. In such case, when e.g. configuration of neutron is
updated, it may happen that first new POD with Neutron is spawned and
only once it is already running, k8s will stop old POD. Because of that
maintenance worker running in the new neutron-server POD will not
acquire lock on the OVN DB (old POD still holds the lock) and will not
run all those maintenance tasks immediately. After old POD will be
terminated, one of the new PODs will at some point acquire that lock and
then will run all those maintenance tasks but this would cause 600
seconds delay in running them.
To avoid such long waiting time to run those maintenance tasks, this
patch lowers its spacing time from 600 to just 5 seconds.
Additionally maintenance tasks which are supposed to be run only once and
only by the maintenance worker which has acquired ovn db lock will now be
stopped (periodic.NeverAgain will be raised) after 100 attempts of
run.
This will avoid running them every 5 seconds forever on the workers
which don't acquire lock on the OVN DB at all.
Closes-bug: #2074209
Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2074209
Title:
OVN maintenance tasks may be delayed 10 minutes in the podified
deployment
Status in neutron:
Fix Released
Bug description:
When running Neutron server on the K8s (or OpenShift) cluster it may
happen that ovn maintenance periodic tasks which are supposed to be
run imediatelly are delayed for about 10 minutes. It is like when e.g.
Neutron's configuration is changed and K8s is restarting neutron pods.
What happens in such case is:
1. pods with neutron-api application are running,
2. configuration is updated and k8s is first starting new pods and after new ones are ready it terminates old pods,
3. during that time, neutron-server process which runs in the new pod is starting maintenance task and it immediately tries to run tasks defined with "periodics.periodic(spacing=600, run_immediately=True)" decorator.
4. This new pod don't yet have lock to the ovn northbound db so each of such maintenance tasks is stopped immediately,
5. Few seconds later OLD neutron-server pod is terminated by k8s and then new pod (the one started above in point 3) got lock to the ovn database,
6. Now all maintenance tasks are run again by the maintenance worked after time defined in the "spacing" parameter which is 600 seconds. This 600 seconds is pretty long time to wait for e.g. some options in the ovn database will be adjusted to the new Neutron configuration.
We could reduce this spacing time to e.g. 5 seconds. This will
decrease this additonal waiting time significantly in the case
described in this bug. It would make all those methods to be called
much more often in neutron-server processes which don't have lock
granted but we may introduce additional parameter for that and e.g.
raise NeverAgain() after 100 attempts of run such periodic task.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2074209/+subscriptions
References