yahoo-eng-team team mailing list archive

Thread
Date

[Bug 2074209] Re: OVN maintenance tasks may be delayed 10 minutes in the podified deployment

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <2074209@xxxxxxxxxxxxxxxxxx>
Date: Fri, 02 Aug 2024 10:37:34 -0000
Reply-to: Bug 2074209 <2074209@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/925194
Committed: https://opendev.org/openstack/neutron/commit/04c217bcd0eda07d52a60121b6f86236ba6e26ee
Submitter: "Zuul (22348)"
Branch:    master

commit 04c217bcd0eda07d52a60121b6f86236ba6e26ee
Author: Slawek Kaplonski <skaplons@xxxxxxxxxx>
Date:   Tue Jul 30 14:17:44 2024 +0200

    Lower spacing time of the OVN maintenance tasks which should be run once
    
    Some of the OVN maintenance tasks are expected to be run just once and
    then they raise periodic.NeverAgain() to not be run anymore. Those tasks
    also require to have acquried ovn db lock so that only one of the
    maintenance workers really runs them.
    All those tasks had set 600 seconds as a spacing time so they were run
    every 600 seconds. This works fine usually but that may cause small
    issue in the environments were Neutron is run in POD as k8s/openshift
    application. In such case, when e.g. configuration of neutron is
    updated, it may happen that first new POD with Neutron is spawned and
    only once it is already running, k8s will stop old POD. Because of that
    maintenance worker running in the new neutron-server POD will not
    acquire lock on the OVN DB (old POD still holds the lock) and will not
    run all those maintenance tasks immediately. After old POD will be
    terminated, one of the new PODs will at some point acquire that lock and
    then will run all those maintenance tasks but this would cause 600
    seconds delay in running them.
    
    To avoid such long waiting time to run those maintenance tasks, this
    patch lowers its spacing time from 600 to just 5 seconds.
    Additionally maintenance tasks which are supposed to be run only once and
    only by the maintenance worker which has acquired ovn db lock will now be
    stopped (periodic.NeverAgain will be raised) after 100 attempts of
    run.
    This will avoid running them every 5 seconds forever on the workers
    which don't acquire lock on the OVN DB at all.
    
    Closes-bug: #2074209
    Change-Id: Iabb4bb427588c1a5da27a5d313f75b5bd23805b2


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2074209

Title:
  OVN maintenance tasks may be delayed 10 minutes in the podified
  deployment

Status in neutron:
  Fix Released

Bug description:
  When running Neutron server on the K8s (or OpenShift) cluster it may
  happen that ovn maintenance periodic tasks which are supposed to be
  run imediatelly are delayed for about 10 minutes. It is like when e.g.
  Neutron's configuration is changed and K8s is restarting neutron pods.
  What happens in such case is:

  1. pods with neutron-api application are running,
  2. configuration is updated and k8s is first starting new pods and after new ones are ready it terminates old pods,
  3. during that time, neutron-server process which runs in the new pod is starting maintenance task and it immediately tries to run tasks defined with "periodics.periodic(spacing=600, run_immediately=True)" decorator.
  4. This new pod don't yet have lock to the ovn northbound db so each of such maintenance tasks is stopped immediately,
  5. Few seconds later OLD neutron-server pod is terminated by k8s and then new pod (the one started above in point 3) got lock to the ovn database,
  6. Now all maintenance tasks are run again by the maintenance worked after time defined in the "spacing" parameter which is 600 seconds. This 600 seconds is pretty long time to wait for e.g. some options in the ovn database will be adjusted to the new Neutron configuration.

  We could reduce this spacing time to e.g. 5 seconds. This will
  decrease this additonal waiting time significantly in the case
  described in this bug. It would make all those methods to be called
  much more often in neutron-server processes which don't have lock
  granted but we may introduce additional parameter for that and e.g.
  raise NeverAgain() after 100 attempts of run such periodic task.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2074209/+subscriptions

References

[Bug 2074209] [NEW] OVN maintenance tasks may be delayed 10 minutes in the podified deployment
From: Slawek Kaplonski, 2024-07-26