launchpad-dev team mailing list archive

Thread
Date

Re: Design pattern for interruptible processing of continuous data

To: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: Julian Edwards <julian.edwards@xxxxxxxxxxxxx>
Date: Tue, 4 Jan 2011 17:29:17 +0000
In-reply-to: <4D2349CE.8090500@canonical.com>
Organization: Canonical Ltd
User-agent: KMail/1.13.5 (Linux/2.6.35-23-generic; KDE/4.5.1; x86_64; ; )

On Tuesday 04 January 2011 16:24:46 Aaron Bentley wrote:
> > In a previous life, the context data that I've used for this is a
> > timestamp, and it worked very well in pretty much all cases I came
> > across.  The client application simply provides the same timestamp to a
> > query/api call from the last item it processed, and the data continues
> > to flow from where it left off. This ticked all the boxes for data
> > integrity and polling or streaming usage.
> 
> Timestamps are an approximation of sequence, because sometimes there are
> multiple rows with the same timestamp.  This is not unlikely, because
> the multiple rows may created as part of a single transaction.

Right, I'd forgotten that the implementation I previously worked on 
automatically appended a guaranteed-unique counter to the timestamp micro- (or 
maybe nano) seconds part.

> Because there's room for ambiguity, a process such as you describe could
> 1. refuse to stop while the next row to be processed has the same
>    timestamp as the current row.
> 2. stop in the middle and when it starts again, skip the remainder of
>    items with the same timestamp.
> 3. stop in the middle and when it starts again, start from the first
>    item with that timestamp.
> 4. ?
> 
> 1. could work, if it's not essential that we stop immediately.
> 2. is usually undesirable, but can sometimes be fixed up by a second
>    cron job that detects that the work still needs to be done.
> 3. could work, if running the operation twice for a given item doesn't
>    do any harm.  However, we could get stopped again, and again, and
>    again, and never finish running the operation on all rows.

Yeah, none of these are acceptable really, but if there's only a single 
writer, writing single records in each transaction, then it will work as I 
proposed.

> As I said, timestamps are an approximation of sequence, but we have
> genuine sequences for pretty much every table: integer ID columns.  If
> you order the operations by database ID rather than by timestamp, then
> you can record the last ID completed, and there is no room for
> ambiguity.  So I think it's simpler to use database IDs rather than
> timestamps.

There's a couple of reasons I shied away from IDs:
1. Timestamps are really useful to eyeball as opposed to IDs.
2. Simple integer IDs can overflow on a busy system

> Your idea reminds me of two things:
> 1. DBLoopTuner
> 2. "micro-jobs" from
> https://dev.launchpad.net/Foundations/NewTaskSystem/Requirements
> 
> Perhaps those could also provide inspiration?

The idea that I want to encapsulate is the concept of atomically storing a 
restart-point, which I can't find expressed in either of these.

Another thing we could do is to manually add some microseconds on to the 
timestamps or to add a "serial" column, if they get encapsulated in a 
different way.  I've not thought in any depth how I'd do that though.

Do you think that if we narrow down the constraints to single-writer, single 
record per transaction, it would diminish the usefulness of this too much?  
I'm fairly sure it would be OK in the cases I know of already.

J

Follow ups

Re: Design pattern for interruptible processing of continuous data
From: Aaron Bentley, 2011-01-04

References

Design pattern for interruptible processing of continuous data
From: Julian Edwards, 2011-01-04
Re: Design pattern for interruptible processing of continuous data
From: Aaron Bentley, 2011-01-04