← Back to team overview

launchpad-dev team mailing list archive

Design pattern for interruptible processing of continuous data

 

Dear all

I've seen this problem pop up in similar ways a few times now, where we're 
processing a bunch of data in a cron job (whether externally on the API, or 
internally) and it needs to do a batch of work, remember where it left off 
(whether reaching a batch limit or the live data is paused), and continue 
later.

Typically to solve this, the client processing the data stores some piece of 
context about the data it's processing, and uses that data to re-start from 
the right place next time.

I think it would be a good idea to formalise a design around this in such a 
way that will also be beneficial to us when we eventually start using a 
message queuing application.

In a previous life, the context data that I've used for this is a timestamp, 
and it worked very well in pretty much all cases I came across.  The client 
application simply provides the same timestamp to a query/api call from the 
last item it processed, and the data continues to flow from where it left off.  
This ticked all the boxes for data integrity and polling or streaming usage.

We would need to store this context somewhere of course, and I am proposing 
that we create a new generic table for this, along the lines of:

CREATE TABLE DataTimestamps (
    name TEXT NOT NULL,
    timestamp TIMESTAMP NOT NULL
)

where "name" is the name of the client app that's using the data and timestamp 
is the last thing it processed.

The timestamp would also need to live on each context record as well, of 
course.  Most of our data already has this.

This will be immediately useful in the Derived Distros feature that my team is 
working on, so I'm keen to get this sorted out quickly.

All constructive comments welcome.

Cheers
J



Follow ups