launchpad-dev team mailing list archive

Thread
Date

Re: Design pattern for interruptible processing of continuous data

To: Aaron Bentley <aaron@xxxxxxxxxxxxx>
From: Julian Edwards <julian.edwards@xxxxxxxxxxxxx>
Date: Wed, 5 Jan 2011 11:34:33 +0000
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <4D23A197.9010603@canonical.com>
Organization: Canonical Ltd
User-agent: KMail/1.13.5 (Linux/2.6.35-24-generic; KDE/4.5.1; x86_64; ; )

On Tuesday 04 January 2011 22:39:19 Aaron Bentley wrote:
> On 11-01-04 12:29 PM, Julian Edwards wrote:
> > Yeah, none of these are acceptable really, but if there's only a single
> > writer, writing single records in each transaction, then it will work as
> > I proposed.
> 
> Earlier, you said:
> > The timestamp would also need to live on each context record as well, of
> > course.  Most of our data already has this.
> 
> So I assumed that you intended to use our existing data as the context
> records.  Am I mistaken?

I just meant that we have a timestamp field to use in the future.  I don't 
care too much about data there already.

> > 2. Simple integer IDs can overflow on a busy system
> 
> I'm not sure what you mean.  I'm sure we both agree that there are
> maximum values that an ID can have, depending on its integer type.

[snip stuff about bigint]

I think you're probably right here, my thoughts were clouded by past 
experiences.

> > The idea that I want to encapsulate is the concept of atomically storing
> > a restart-point, which I can't find expressed in either of these.
> 
> Sure, but that's not the only way to solve your use cases.  These both
> provide design patterns that could be used for interruptible processing
> of continuous data.
> 
> DBLoopTuner relies on the TuneableLoop's __call__() method to store the
> restart-point.  So for example,
> lp.translations.scripts.verify_pofile_stats.Verifier uses self.start_id
> as the restart-point.  Your idea is similar to a TuneableLoop, except
> that you want to store the restart point, and you want it to be
> explicitly a timestamp instead of having it be an implementation detail.

The TuneableLoop stuff does not provide any mechanism to store restart points 
itself, it relies on the code that inherits from it.

I want something that will store these restart points atomically with the 
operations that are processing them, and preferably in such a way that I don't 
have to think about it too much.

> To apply "micro-jobs" to this problem, you would represent each
> operation as a "micro-job".  You would directly represent which jobs had
> been run and which ones had not.  The specifics depend on how we end up
> implementing the new task system, but one obvious way would be to have a
> status like BuildStatus for each micro-job.

Micro jobs are a nice idea, but are orthogonal to what I want to do here.  
They might even end up using this design.

> I guess you mean changing the microseconds on the timestamps to ensure
> they are unique?  That does not guarantee uniqueness unless you have a
> single writer that does not run more than once per microsecond.
> However, if we were really busy, that would be too slow.

True enough.

> I think that if we need an ordered list of unique identifiers, then it's
> much simpler to use integer IDs than timestamps.

I think you're right, basically because of what we get from postgres.  My 
previous experiences were on an in-house DB solution that did just all this 
stuff for you and it's clouded my thoughts a bit (along with the manflu!).

So if we have a table with (name, sequence) columns, is there anything else to 
be concerned with?

Cheers.

Follow ups

Re: Design pattern for interruptible processing of continuous data
From: Aaron Bentley, 2011-01-05

References

Design pattern for interruptible processing of continuous data
From: Julian Edwards, 2011-01-04
Re: Design pattern for interruptible processing of continuous data
From: Julian Edwards, 2011-01-04
Re: Design pattern for interruptible processing of continuous data
From: Aaron Bentley, 2011-01-04