credativ team mailing list archive

Thread
Date
Re: [Bug 677257] Re: Scheduler won't reschedule a task if it takes too long

To: credativ@xxxxxxxxxxxxxxxxxxx
From: "Olivier Dony \(OpenERP\)" <677257@xxxxxxxxxxxxxxxxxx>
Date: Mon, 19 Dec 2011 10:39:25 -0000
Reply-to: Bug 677257 <677257@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
On 12/19/2011 10:25 AM, Ghislain Nebra (INCB) wrote:
> I would also add that in _poolJos function, sql requests are using
> "now()" of Postgre whereas in the if statement, the "DateTime.now()"
> of Python is used. Only one "now" origin should be used : the Python
> one.
>
> Here is my suggestion :
> cr.execute("select * from ir_cron where numbercall<>0 and active
>             and nextcall<=now() order by priority")
> should be
> cr.execute("select * from ir_cron where numbercall<>0 and active
>             and nextcall<='" + now.strftime('%Y-%m-%d %H:%M:%S')
>             + "' order by priority")
>
> This is very important if your PostgreSQL database is not on the same
> computer than the OpenERP server because a small difference in clock
> could lead to non-working cron jobs.


I agree that it would be more consistent to use the same reference time
everywhere in the code. Practically however the time offset should be
minimal because as of 6.1 we have switched the server to use UTC
everywhere, and we're storing only UTC timestamps in the database.
So the only difference here would be a clock sync difference between the
database machine and the OpenERP machine, as you say.

As of 6.1 the cron mechanism is multi-threaded and supports
load-balancing on multiple OpenERP servers talking to the same database.
In such environments it will be important to have all machines properly
time-sync'ed, e.g. with NTP, because there's a limit to what magic the
system can do to avoid sync issues.

In a load-balancing configuration there is always a way for things to go
wrong if all the servers are not properly sync'ed. Imagine a 1 hour
offset between the machines (while all of them think they're all at
UTC!): you'll get different results depending on which machine runs the
job vs. which was used to configure its execution time.

So if you can come up with a nice patch to improve this logic, feel free
to propose a merge for trunk (see the doc[1]), but keep in mind that
there's a limit to what we can do in case of bad time sync.
In time-critical deployment environments, setting up a proper clock sync
across all machines seems like a pre-requisite.

Note about the patch:
- you should pass the time value as a query parameter instead of
mangling the query string (bad practice!)
- you need to study the latest trunk code and make a patch against it,
such changes won't be accepted in a stable branch (this is a minor
improvement that can be replaced by proper config and clock sync)


> Moreover in this function, the "while nextcall < now and numbercall:"
> should be "while nextcall <= now and numbercall:" because if the cron
> job is planned at 1:00, the timer will wake up at 1:00 ... and you want
> your cron job to be executed !

Actually this will never be a problem because even if the job execution
was started exactly on the second where it was scheduled to run, the
value of datetime.now() includes precision up to the microsecond. It is
compared with the nextcall value from the database, which is rounded to
the second, so you'll get such a comparison:
 2011-12-19 10:33:07 < 2011-12-19 10:33:07.290877

Granted, changing the comparison to `<=` would not hurt (it's only here
to stop repeating calls when nextcall is in the future), but it won't
cause any issue here. If you make a merge proposal it should be accepted
if you change that operator too, if you like.

Thanks!

[1] http://bit.ly/openerp-contrib-mp

-- 
You received this bug notification because you are a member of OpenERP
Framework Experts, which is subscribed to OpenERP Server.
https://bugs.launchpad.net/bugs/677257

Title:
  Scheduler won't reschedule a task if it takes too long

Status in OpenERP Server:
  Fix Released
Status in OpenERP Server 5.0 series:
  Fix Released

Bug description:
  We ran into this problem because we were running the mrp scheduler
  every two minutes and it started to take longer than a minute to run.
  Suddenly, it would just stop being scheduled.

  It looks like this is what happens in the ir_cron._poolJobs() method:

  1. Get the current time and hold it in the "now" variable.
  2. Find all active jobs whose next call time has passed.
  3. Run each job.
  4. Increment the next call time by the job's interval until it passes the "now" variable which may be a few minutes in the past if the jobs took a while to execute.
  5. Update the job's next call time in the database.
  6. Find all active jobs whose next call time is in the future and schedule the first one.

  If a job ends up getting scheduled for a time after the "now" variable
  but before step 6 executes then it will no longer be executed.

  Here's a scenario where that could happen. The mrp scheduler is
  scheduled to run every five minutes and it takes two minutes to run.

  10:00 mrp starts
  10:02 mrp finishes, scheduled for 10:05
  10:03 admin shuts down the server for maintenance
  10:09 admin starts up the server and connects to database. mrp starts
  10:11 mrp finishes, scheduled for 10:10 (10:05 plus 5 minutes, it's after 10:09). mrp is no longer in the list of tasks in memory.

  Once this happens, I think there is a two-in-five chance that the mrp
  task will run once at start up and not be scheduled after that.

  Why does step six above need to check that the next time is in the
  future? Shouldn't it just schedule the minimum time for all active
  tasks?

To manage notifications about this bug go to:
https://bugs.launchpad.net/openobject-server/+bug/677257/+subscriptions
References

[Bug 677257] Re: Scheduler won't reschedule a task if it takes too long
From: Ghislain Nebra (INCB), 2011-12-19