← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2074209] [NEW] OVN maintenance tasks may be delayed 10 minutes in the podified deployment

 

Public bug reported:

When running Neutron server on the K8s (or OpenShift) cluster it may
happen that ovn maintenance periodic tasks which are supposed to be run
imediatelly are delayed for about 10 minutes. It is like when e.g.
Neutron's configuration is changed and K8s is restarting neutron pods.
What happens in such case is:

1. pods with neutron-api application are running,
2. configuration is updated and k8s is first starting new pods and after new ones are ready it terminates old pods,
3. during that time, neutron-server process which runs in the new pod is starting maintenance task and it immediately tries to run tasks defined with "periodics.periodic(spacing=600, run_immediately=True)" decorator.
4. This new pod don't yet have lock to the ovn northbound db so each of such maintenance tasks is stopped immediately,
5. Few seconds later OLD neutron-server pod is terminated by k8s and then new pod (the one started above in point 3) got lock to the ovn database,
6. Now all maintenance tasks are run again by the maintenance worked after time defined in the "spacing" parameter which is 600 seconds. This 600 seconds is pretty long time to wait for e.g. some options in the ovn database will be adjusted to the new Neutron configuration.

We could reduce this spacing time to e.g. 5 seconds. This will decrease
this additonal waiting time significantly in the case described in this
bug. It would make all those methods to be called much more often in
neutron-server processes which don't have lock granted but we may
introduce additional parameter for that and e.g. raise NeverAgain()
after 100 attempts of run such periodic task.

** Affects: neutron
     Importance: Medium
     Assignee: Slawek Kaplonski (slaweq)
         Status: Confirmed


** Tags: ovn

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2074209

Title:
  OVN maintenance tasks may be delayed 10 minutes in the podified
  deployment

Status in neutron:
  Confirmed

Bug description:
  When running Neutron server on the K8s (or OpenShift) cluster it may
  happen that ovn maintenance periodic tasks which are supposed to be
  run imediatelly are delayed for about 10 minutes. It is like when e.g.
  Neutron's configuration is changed and K8s is restarting neutron pods.
  What happens in such case is:

  1. pods with neutron-api application are running,
  2. configuration is updated and k8s is first starting new pods and after new ones are ready it terminates old pods,
  3. during that time, neutron-server process which runs in the new pod is starting maintenance task and it immediately tries to run tasks defined with "periodics.periodic(spacing=600, run_immediately=True)" decorator.
  4. This new pod don't yet have lock to the ovn northbound db so each of such maintenance tasks is stopped immediately,
  5. Few seconds later OLD neutron-server pod is terminated by k8s and then new pod (the one started above in point 3) got lock to the ovn database,
  6. Now all maintenance tasks are run again by the maintenance worked after time defined in the "spacing" parameter which is 600 seconds. This 600 seconds is pretty long time to wait for e.g. some options in the ovn database will be adjusted to the new Neutron configuration.

  We could reduce this spacing time to e.g. 5 seconds. This will
  decrease this additonal waiting time significantly in the case
  described in this bug. It would make all those methods to be called
  much more often in neutron-server processes which don't have lock
  granted but we may introduce additional parameter for that and e.g.
  raise NeverAgain() after 100 attempts of run such periodic task.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2074209/+subscriptions



Follow ups