launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #06079
Re: Design pattern for interruptible processing of continuous data
On Tuesday 04 January 2011 16:24:46 Aaron Bentley wrote:
> > In a previous life, the context data that I've used for this is a
> > timestamp, and it worked very well in pretty much all cases I came
> > across. The client application simply provides the same timestamp to a
> > query/api call from the last item it processed, and the data continues
> > to flow from where it left off. This ticked all the boxes for data
> > integrity and polling or streaming usage.
>
> Timestamps are an approximation of sequence, because sometimes there are
> multiple rows with the same timestamp. This is not unlikely, because
> the multiple rows may created as part of a single transaction.
Right, I'd forgotten that the implementation I previously worked on
automatically appended a guaranteed-unique counter to the timestamp micro- (or
maybe nano) seconds part.
> Because there's room for ambiguity, a process such as you describe could
> 1. refuse to stop while the next row to be processed has the same
> timestamp as the current row.
> 2. stop in the middle and when it starts again, skip the remainder of
> items with the same timestamp.
> 3. stop in the middle and when it starts again, start from the first
> item with that timestamp.
> 4. ?
>
> 1. could work, if it's not essential that we stop immediately.
> 2. is usually undesirable, but can sometimes be fixed up by a second
> cron job that detects that the work still needs to be done.
> 3. could work, if running the operation twice for a given item doesn't
> do any harm. However, we could get stopped again, and again, and
> again, and never finish running the operation on all rows.
Yeah, none of these are acceptable really, but if there's only a single
writer, writing single records in each transaction, then it will work as I
proposed.
> As I said, timestamps are an approximation of sequence, but we have
> genuine sequences for pretty much every table: integer ID columns. If
> you order the operations by database ID rather than by timestamp, then
> you can record the last ID completed, and there is no room for
> ambiguity. So I think it's simpler to use database IDs rather than
> timestamps.
There's a couple of reasons I shied away from IDs:
1. Timestamps are really useful to eyeball as opposed to IDs.
2. Simple integer IDs can overflow on a busy system
> Your idea reminds me of two things:
> 1. DBLoopTuner
> 2. "micro-jobs" from
> https://dev.launchpad.net/Foundations/NewTaskSystem/Requirements
>
> Perhaps those could also provide inspiration?
The idea that I want to encapsulate is the concept of atomically storing a
restart-point, which I can't find expressed in either of these.
Another thing we could do is to manually add some microseconds on to the
timestamps or to add a "serial" column, if they get encapsulated in a
different way. I've not thought in any depth how I'd do that though.
Do you think that if we narrow down the constraints to single-writer, single
record per transaction, it would diminish the usefulness of this too much?
I'm fairly sure it would be OK in the cases I know of already.
J
Follow ups
References