maria-developers team mailing list archive

Thread
Date

Re: Implementing new "group commit" API in PBXT?

To: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
From: Paul McCullagh <paul.mccullagh@xxxxxxxxxxxxx>
Date: Wed, 6 Oct 2010 08:43:04 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <87bp78g4yw.fsf@knielsen-hq.org>

On Oct 5, 2010, at 3:10 PM, Kristian Nielsen wrote:

Paul McCullagh <paul.mccullagh@xxxxxxxxxxxxx> writes:

The easiest way to do this would be to add a parameter to
xn_end_xact() that indicates that the log should not be written or
flushed.

Ok, I gave it a shot, but I had some problems due to not knowing thePBXT code

sufficiently ...


In that case, judging by your questions, you catch on quick! :)

In xn_end_xact(), the last parameter to the call toxt_xlog_log_data()
determines what should happen:

#define XT_XLOG_NO_WRITE_NO_FLUSH	0
#define XT_XLOG_WRITE_AND_FLUSH		1
#define XT_XLOG_WRITE_AND_NO_FLUSH	2

Without write or flush, this is a very fast operation. But the
transaction is still committed and ordered, it is just not durable.
I notice that xs_end_xact() does a number of things. I am wonderingif all ofthese should be in the "fast" part in commit_ordered(), or if someshould be
done in the "slow" part along with the log flush?

In particular this, flushing the data log (is this flush to disk?):


Yes, this is a flush to disk.

This could be done in the slow part (obviously this would be ideal).

But there is the following problem that should then be fixed.

If we write the transaction log (i.e. commit the transaction), even ifwe do not flush the transaction log. It may be flushed by some otherthread later. This will make the commit durable (in other words, onrecovery, this transaction will be rolled forward).

If we do not flush the data log, then there is a chance that such acommit transaction is incomplete, because the associated data log datahas not been committed.

The way to fix this problem is to check the extend of flushing of boththe data and the transaction log on recovery. Simply put, on recoverwe check if the data log part of each record is completely flushed (iswithin the flush zone of the data log).

If a data log record is missing, then recovery stops at that point inthe transaction log.

This will have to be built into the engine. And, it is easiest to dothis in PBXT 1.5 which handle transaction logs and data logsidentically.

   if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) {
           ok = FALSE;
           status = XT_LOG_ENT_ABORT;
   }

and this, at the end concerning the "sweeper":

   if (db->db_sw_faster)
           xt_wakeup_sweeper(db);

Yes, this could be taken out of the fast part, although it is notcalled all that often.

   /* Don't get too far ahead of the sweeper! */
   if (writer) {
       ...
Can you help suggest if these should be done in the "fast" part, orin the
"slow" part?
Also, this statement definitely needs to be postponed to the "slow"part I
guess:

   thread->st_xact_data = NULL;

Actually, I don't think so. As far as PBXT is concerned, after thefast part has run, the transaction is committed. It is just not durable.

This means that anything we do in the slow part should not need anexplicit reference to the transaction.

Then when actual commit is called, we check the current log flush
position against the flush position we need. If it is passed our
position then this is a NOP.
I think I can do this with a condition like this:
if (xt_comp_log_pos(self->commit_fastpart_log_id, self->commit_fastpart_log_offset, xl_flush_log_id, xl_flush_log_offset)<= 0)


Yes!

But I am wondering if I need to take any locks around readingxl_flush_log_idand xl_flush_log_offset? Or can one argue that a dirty read could beok (as
long as it's atomic) as the values are probably monotonic?

Basically yes. I believe I do this without lock elsewhere, and havetaken care that this works.

The flush log position is always increasing. Critical is when weswitch logs, e.g. from log_id=100, log_offset=80000, to log_id=101,log_offset=0.

I believe when this is done, the log_offset is first set to zero, thenthe log_id is incremented (should check this).

This would mean that the comparing function would err on the side offlushing unnecessarily if the check comes between the to operations.

If not, then we need to call xlog_append() with no data. This will do
a group commit on the log.
Is it safe to call xlog_append() with no data even if the log hasbeen flushedpast the current position already? (else some locking seemsdefinitely needed).

Yes, it is safe. If there is nothing to do, xlog_append() will justreturn.

I was a bit difficult to explain, so please ask if anything is not
clear.
Hopefully you can help with some of the above points, then I cangive it
another go with fresh eyes and maybe show you a patch.
(If I get to that point, I will probably also need some advice onthe proper
error handling)...


Yup, always the tricky part!

Anyway, from what you wrote and from what I see in the code, itseems the APII propose is general enough to fit well with PBXT, which is good andwhat Iwanted to check (Even if xn_end_xact() may need to be taken apart abit to
properly split into a "fast" and a "slow" part).


I would actually recommend a "lazy" approach to the implementation.

Simply add a boolean to the current commit, which indicates a fastcommit should be done.

Then we add a new "slow commit" function which does the parts not doneby the slow commit.



--
Paul McCullagh
PrimeBase Technologies
www.primebase.org
www.blobstreaming.org
pbxt.blogspot.com

Follow ups

Re: Implementing new "group commit" API in PBXT?
From: Kristian Nielsen, 2010-10-10

References

Implementing new "group commit" API in PBXT?
From: Kristian Nielsen, 2010-09-29
Re: Implementing new "group commit" API in PBXT?
From: Paul McCullagh, 2010-10-04
Re: Implementing new "group commit" API in PBXT?
From: Kristian Nielsen, 2010-10-05