launchpad-dev team mailing list archive

Thread
Date

Re: Design pattern for interruptible processing of continuous data

To: Julian Edwards <julian.edwards@xxxxxxxxxxxxx>
From: Aaron Bentley <aaron@xxxxxxxxxxxxx>
Date: Tue, 04 Jan 2011 17:39:19 -0500
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <201101041729.17568.julian.edwards@canonical.com>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101208 Thunderbird/3.1.7

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11-01-04 12:29 PM, Julian Edwards wrote:
> Yeah, none of these are acceptable really, but if there's only a single 
> writer, writing single records in each transaction, then it will work as I 
> proposed.

Earlier, you said:
> The timestamp would also need to live on each context record as well, of
> course.  Most of our data already has this.

So I assumed that you intended to use our existing data as the context
records.  Am I mistaken?

If we do use our existing data as context records, we will be adding new
constraints on how we create our existing data (single writer, multiple
transactions), and new failure modes.  For example, if we break up code
into multiple transactions, that could break an assumption that the
entire operation either succeeds or fails.  For another example, someone
could come along later and, not understanding why there are multiple
transactions, "optimize" the code into a single transaction.

>> As I said, timestamps are an approximation of sequence, but we have
>> genuine sequences for pretty much every table: integer ID columns.  If
>> you order the operations by database ID rather than by timestamp, then
>> you can record the last ID completed, and there is no room for
>> ambiguity.  So I think it's simpler to use database IDs rather than
>> timestamps.
> 
> There's a couple of reasons I shied away from IDs:
> 1. Timestamps are really useful to eyeball as opposed to IDs.

Agreed.

> 2. Simple integer IDs can overflow on a busy system

I'm not sure what you mean.  I'm sure we both agree that there are
maximum values that an ID can have, depending on its integer type.

However, AFAIK, none of our existing data exceeds the scope of the
default 4-byte integer type.  Only BranchRevision comes close.

But maybe you're not talking about our existing data?  If we have new
data that our standard 4-byte integer can't handle, shouldn't we use
BIGINT for its database ID?  BIGINT can represent just as many discrete
values as a timestamp, because both are eight-byte numbers.  But because
a BIGINT primary key is used sequentially, far fewer discrete values are
wasted.  Timestamps represent 4713 BCE to 5874897 CE with a
1-microsecond resolution, but it's doubtful that Launchpad will need to
represent a range of more than 100 years, and the rest of the values are
wasted.

Or to look at it another way, let's assume we exhaust a BIGINT in a
year.  That's 584,942 records per microsecond:

BIGINT_SIZE = pow(256, 8)
YEAR_MICROSECONDS = (365 * 24 * 60 * 60 * 1000000)
print BIGINT_SIZE / YEAR_MICROSECONDS
584942L

So if we're so busy that BIGINT is inadequate, we will have hundreds of
thousands of duplicate timestamps.

Were you also suggesting that when we reach the maximum value of an ID
column, we will overflow back to 1?  I'm no SQL expert, but I cannot
find any documentation that says they can.  I've had a look at
http://www.postgresql.org/docs/8.4/static/functions-sequence.html and it
doesn't suggest they can.

Even if BIGINT does wrap, it's likely to violate a unique constraint.
Duplicate timestamps will be a silent failure, unless you also add a
unique constraint to the timestamps.  In any case, this is a risk we run
with every table in the database.

>> Your idea reminds me of two things:
>> 1. DBLoopTuner
>> 2. "micro-jobs" from
>> https://dev.launchpad.net/Foundations/NewTaskSystem/Requirements
>>
>> Perhaps those could also provide inspiration?
> 
> The idea that I want to encapsulate is the concept of atomically storing a 
> restart-point, which I can't find expressed in either of these.

Sure, but that's not the only way to solve your use cases.  These both
provide design patterns that could be used for interruptible processing
of continuous data.

DBLoopTuner relies on the TuneableLoop's __call__() method to store the
restart-point.  So for example,
lp.translations.scripts.verify_pofile_stats.Verifier uses self.start_id
as the restart-point.  Your idea is similar to a TuneableLoop, except
that you want to store the restart point, and you want it to be
explicitly a timestamp instead of having it be an implementation detail.

To apply "micro-jobs" to this problem, you would represent each
operation as a "micro-job".  You would directly represent which jobs had
been run and which ones had not.  The specifics depend on how we end up
implementing the new task system, but one obvious way would be to have a
status like BuildStatus for each micro-job.

> Another thing we could do is to manually add some microseconds on to the 
> timestamps

I guess you mean changing the microseconds on the timestamps to ensure
they are unique?  That does not guarantee uniqueness unless you have a
single writer that does not run more than once per microsecond.
However, if we were really busy, that would be too slow.

> or to add a "serial" column,

If you add an integer column, it would make sense for it to refer to the
database ids as its values.  In this case, declaring it "serial" to
Postgres wouldn't matter, because you'd be assigning arbitrary values to
it.  Of course, once you're using database ids, you don't need the
timestamps anymore.

If it used its own sequence to refer to the rows, then we would need to
add a column to every supported table and assign a value to every row of
those tables that could be referenced, which I think would get messy.

> if they get encapsulated in a 
> different way.  I've not thought in any depth how I'd do that though.
> 
> Do you think that if we narrow down the constraints to single-writer, single 
> record per transaction, it would diminish the usefulness of this too much?  
> I'm fairly sure it would be OK in the cases I know of already.

I think that it it's in conflict with your argument that "Simple integer
IDs can overflow on a busy system", because that implies hundreds of
thousands of writes per microsecond.

I think that if we need an ordered list of unique identifiers, then it's
much simpler to use integer IDs than timestamps.

Aaron
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0joZcACgkQ0F+nu1YWqI3mtACdF2n6Xe+bEJ7hhZJem3tZY5RL
AYcAnjeOqRkDPKLyy3p/OIeRervy8N0f
=iSKa
-----END PGP SIGNATURE-----

Follow ups

Re: Design pattern for interruptible processing of continuous data
From: Julian Edwards, 2011-01-05

References

Design pattern for interruptible processing of continuous data
From: Julian Edwards, 2011-01-04
Re: Design pattern for interruptible processing of continuous data
From: Aaron Bentley, 2011-01-04
Re: Design pattern for interruptible processing of continuous data
From: Julian Edwards, 2011-01-04