maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #03639
Re: Implementing new "group commit" API in PBXT?
On Oct 5, 2010, at 3:10 PM, Kristian Nielsen wrote:
Paul McCullagh <paul.mccullagh@xxxxxxxxxxxxx> writes:
The easiest way to do this would be to add a parameter to
xn_end_xact() that indicates that the log should not be written or
flushed.
Ok, I gave it a shot, but I had some problems due to not knowing the
PBXT code
sufficiently ...
In that case, judging by your questions, you catch on quick! :)
In xn_end_xact(), the last parameter to the call to
xt_xlog_log_data()
determines what should happen:
#define XT_XLOG_NO_WRITE_NO_FLUSH 0
#define XT_XLOG_WRITE_AND_FLUSH 1
#define XT_XLOG_WRITE_AND_NO_FLUSH 2
Without write or flush, this is a very fast operation. But the
transaction is still committed and ordered, it is just not durable.
I notice that xs_end_xact() does a number of things. I am wondering
if all of
these should be in the "fast" part in commit_ordered(), or if some
should be
done in the "slow" part along with the log flush?
In particular this, flushing the data log (is this flush to disk?):
Yes, this is a flush to disk.
This could be done in the slow part (obviously this would be ideal).
But there is the following problem that should then be fixed.
If we write the transaction log (i.e. commit the transaction), even if
we do not flush the transaction log. It may be flushed by some other
thread later. This will make the commit durable (in other words, on
recovery, this transaction will be rolled forward).
If we do not flush the data log, then there is a chance that such a
commit transaction is incomplete, because the associated data log data
has not been committed.
The way to fix this problem is to check the extend of flushing of both
the data and the transaction log on recovery. Simply put, on recover
we check if the data log part of each record is completely flushed (is
within the flush zone of the data log).
If a data log record is missing, then recovery stops at that point in
the transaction log.
This will have to be built into the engine. And, it is easiest to do
this in PBXT 1.5 which handle transaction logs and data logs
identically.
if (!thread->st_dlog_buf.dlb_flush_log(TRUE, thread)) {
ok = FALSE;
status = XT_LOG_ENT_ABORT;
}
and this, at the end concerning the "sweeper":
if (db->db_sw_faster)
xt_wakeup_sweeper(db);
Yes, this could be taken out of the fast part, although it is not
called all that often.
/* Don't get too far ahead of the sweeper! */
if (writer) {
...
Can you help suggest if these should be done in the "fast" part, or
in the
"slow" part?
Also, this statement definitely needs to be postponed to the "slow"
part I
guess:
thread->st_xact_data = NULL;
Actually, I don't think so. As far as PBXT is concerned, after the
fast part has run, the transaction is committed. It is just not durable.
This means that anything we do in the slow part should not need an
explicit reference to the transaction.
Then when actual commit is called, we check the current log flush
position against the flush position we need. If it is passed our
position then this is a NOP.
I think I can do this with a condition like this:
if (xt_comp_log_pos(self->commit_fastpart_log_id, self-
>commit_fastpart_log_offset, xl_flush_log_id, xl_flush_log_offset)
<= 0)
Yes!
But I am wondering if I need to take any locks around reading
xl_flush_log_id
and xl_flush_log_offset? Or can one argue that a dirty read could be
ok (as
long as it's atomic) as the values are probably monotonic?
Basically yes. I believe I do this without lock elsewhere, and have
taken care that this works.
The flush log position is always increasing. Critical is when we
switch logs, e.g. from log_id=100, log_offset=80000, to log_id=101,
log_offset=0.
I believe when this is done, the log_offset is first set to zero, then
the log_id is incremented (should check this).
This would mean that the comparing function would err on the side of
flushing unnecessarily if the check comes between the to operations.
If not, then we need to call xlog_append() with no data. This will do
a group commit on the log.
Is it safe to call xlog_append() with no data even if the log has
been flushed
past the current position already? (else some locking seems
definitely needed).
Yes, it is safe. If there is nothing to do, xlog_append() will just
return.
I was a bit difficult to explain, so please ask if anything is not
clear.
Hopefully you can help with some of the above points, then I can
give it
another go with fresh eyes and maybe show you a patch.
(If I get to that point, I will probably also need some advice on
the proper
error handling)...
Yup, always the tricky part!
Anyway, from what you wrote and from what I see in the code, it
seems the API
I propose is general enough to fit well with PBXT, which is good and
what I
wanted to check (Even if xn_end_xact() may need to be taken apart a
bit to
properly split into a "fast" and a "slow" part).
I would actually recommend a "lazy" approach to the implementation.
Simply add a boolean to the current commit, which indicates a fast
commit should be done.
Then we add a new "slow commit" function which does the parts not done
by the slow commit.
--
Paul McCullagh
PrimeBase Technologies
www.primebase.org
www.blobstreaming.org
pbxt.blogspot.com
Follow ups
References