maria-developers team mailing list archive

Thread
Date
Re: Rev 2740: Group commit for maria storage engine. in file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit/

To: <sanja@xxxxxxxxxxxx>
From: Michael Widenius <monty@xxxxxxxxxxxx>
Date: Tue, 9 Feb 2010 15:45:29 +0200
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20100209083302.60EF6407B6DE@SBMBP.local>
Reply-to: monty@xxxxxxxxxxxx
Hi!

>>>>> "sanja" == sanja  <sanja@xxxxxxxxxxxx> writes:

sanja> At file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit/
sanja> ------------------------------------------------------------
sanja> revno: 2740
sanja> revision-id: sanja@xxxxxxxxxxxx-20100209083259-ekki5zw4hbaeqpwh
sanja> parent: knielsen@xxxxxxxxxxxxxxx-20100201190519-b9uktnn90rwwiile
sanja> committer: sanja@xxxxxxxxxxxx
sanja> branch nick: work-maria-5.2-groupcommit
sanja> timestamp: Tue 2010-02-09 10:32:59 +0200
sanja> message:
sanja>   Group commit for maria storage engine.

> === modified file 'storage/maria/ha_maria.cc'
> --- a/storage/maria/ha_maria.cc	2009-12-03 11:34:11 +0000
> +++ b/storage/maria/ha_maria.cc	2010-02-09 08:32:59 +0000

<cut>

> +static MYSQL_SYSVAR_ULONG(group_commit_interval, maria_group_commit_interval,
> +       PLUGIN_VAR_RQCMDARG,
> +       "Interval between commite in microseconds (1/1000000c)."
> +       " 0 stands for no waiting"
> +       "for other threads to come and do a commit in \"hard\" mode and no"
> +       " sync()/commit at all in \"soft\" mode.  Option has only an effect"
> +       "if maria_group_commit is used",

Fix so that you have a space last for the continous lines.
(The above has 2 text errors:  watingfor and effectif)

> === modified file 'storage/maria/ma_loghandler.c'
> --- a/storage/maria/ma_loghandler.c	2010-01-06 21:27:53 +0000
> +++ b/storage/maria/ma_loghandler.c	2010-02-09 08:32:59 +0000
> @@ -2997,7 +3068,25 @@
>            }
>            DBUG_ASSERT(LSN_FILE_NO(addr) ==  LSN_FILE_NO(curr_buffer->offset));
>            from= curr_buffer->buffer + (addr - curr_buffer->offset);
> -          memcpy(buffer, from, TRANSLOG_PAGE_SIZE);
> +          if (skipped_data > (addr - curr_buffer->offset))
> +          {
> +            /*
> +              We read page part of which is not present in buffer,
> +              so we should read absent part from file (page cache actually)
> +            */
> +            file= get_logfile_by_number(file_no);
> +            DBUG_ASSERT(file != NULL);
> +            buffer= pagecache_read(log_descriptor.pagecache, &file->handler,
> +                                   LSN_OFFSET(addr) / TRANSLOG_PAGE_SIZE,
> +                                   3, buffer,
> +                                   PAGECACHE_PLAIN_PAGE,
> +                                   PAGECACHE_LOCK_LEFT_UNLOCKED,
> +                                   NULL);
> +          }
> +          else
> +            skipped_data= 0;  /* Read after skipped in buffer data */
> +          memcpy(buffer + skipped_data, from + skipped_data,
> +                 TRANSLOG_PAGE_SIZE - skipped_data);

Above you don't handle the unlikely (but possible) error that buffer == 0

<cut>

> +  pthread_mutex_lock(&log_descriptor.log_flush_lock);
> +  DBUG_PRINT("info", ("Everything is flushed up to (%lu,0x%lx)",
> +                      LSN_IN_PARTS(log_descriptor.flushed)));
> +  if (cmp_translog_addr(log_descriptor.flushed, lsn) >= 0)
> +
> +

Remove the two empty lines above.

cut>

> +  for (;;)
> +  {
> +    /* Following function flushes buffers and makes translog_unlock() */
> +    translog_flush_buffers(&lsn, &sent_to_disk, &flush_horizon);
> +
> +    if (!hgroup_commit_at_start)
> +      break;  /* flush pass is ended */
> +
> +retest:
> +    if (flush_interval != 0 &&
> +        (my_micro_time() - flush_start) >= flush_interval)
> +      break;  /* flush pass is ended */
> +
> +    pthread_mutex_lock(&log_descriptor.log_flush_lock);
> +    if (log_descriptor.next_pass_max_lsn != LSN_IMPOSSIBLE)
> +    {
> +      /* take next goal */
> +      lsn= log_descriptor.next_pass_max_lsn;
> +      log_descriptor.next_pass_max_lsn= LSN_IMPOSSIBLE;
> +      /* prevent other thread from continue */
> +      log_descriptor.max_lsn_requester= pthread_self();
> +      DBUG_PRINT("info", ("flush took next goal: (%lu,0x%lx)",
> +                          LSN_IN_PARTS(lsn)));
> +    }
> +    else
> +    {
> +      if (flush_interval == 0 ||
> +          (time_spent= (my_micro_time() - flush_start)) >= flush_interval)
>        {
> +        pthread_mutex_unlock(&log_descriptor.log_flush_lock);
> +        break;
>        }
> +      DBUG_PRINT("info", ("flush waits: %llu  interval: %llu  spent: %llu",
> +                          flush_interval - time_spent,
> +                          flush_interval, time_spent));
> +      /* wait time or next goal */
> +      set_timespec_nsec(abstime, flush_interval - time_spent);
> +      pthread_cond_timedwait(&log_descriptor.new_goal_cond,
> +                             &log_descriptor.log_flush_lock,
> +                             &abstime);
> +      pthread_mutex_unlock(&log_descriptor.log_flush_lock);
> +      DBUG_PRINT("info", ("retest conditions"));
> +      goto retest;
> +    }
> +    pthread_mutex_unlock(&log_descriptor.log_flush_lock);
> +
> +    /* next flush pass */
> +    DBUG_PRINT("info", ("next flush pass"));
> +    translog_lock();
> +  }

What about the idea of setting flush_start to the last call of
my_micro_time() ?

This would mean that on a busy system we would wait, but on a non busy
system (with little syncs) we would not do any waits at all.

And we would win some calls to my_micro_time() too (which may not be a
very cheap call)

The code change would be very simple:

- Remove old call of testing flush_start.
- In the above calls, replace my_micro_time() with 'cached_micro_time=
  my_micro_time()'
  and after end of loop, set flush_start= cached_micro_time

Hm.... maybe this wouldn't work as we are calling sync after this, which
is causing some delays.

> +
> +  /*
> +    sync() files from previous flush till current one
> +  */
> +  if (!soft_sync || hgroup_commit_at_start)
> +  {
> +    if ((rc=
> +         translog_sync_files(LSN_FILE_NO(log_descriptor.flushed),
> +                             LSN_FILE_NO(lsn),
> +                             sync_log_dir >= TRANSLOG_SYNC_DIR_ALWAYS &&
> +                             (LSN_FILE_NO(log_descriptor.
> +                                          previous_flush_horizon) !=
> +                              LSN_FILE_NO(flush_horizon) ||
> +                              (LSN_OFFSET(log_descriptor.
> +                                          previous_flush_horizon) /
> +                               TRANSLOG_PAGE_SIZE) !=
> +                              (LSN_OFFSET(flush_horizon) /
> +                               TRANSLOG_PAGE_SIZE)))))
> +    {
> +      sent_to_disk= LSN_IMPOSSIBLE;
> +      pthread_mutex_lock(&log_descriptor.log_flush_lock);
> +      goto out;
> +    }
> +    /* keep values for soft sync() and forced sync() actual */
> +    {
> +      uint32 fileno= LSN_FILE_NO(lsn);
> +      my_atomic_rwlock_wrlock(&soft_sync_rwl);
> +      my_atomic_store32(&soft_sync_min, fileno);
> +      my_atomic_store32(&soft_sync_max, fileno);
> +      my_atomic_rwlock_wrunlock(&soft_sync_rwl);

Don't understand why my_atomic_rwlock_wrlock is enough protection here.
Why use my_atomic_store32 ?

> +    }
> +  }
> +  else
> +  {
> +    my_atomic_rwlock_wrlock(&soft_sync_rwl);
> +    my_atomic_store32(&soft_sync_max, LSN_FILE_NO(lsn));
> +    my_atomic_rwlock_wrunlock(&soft_sync_rwl);

Do we really need a lock to store one variable?
what is the problem with just doing:

soft_sync_max= LSN_FILE_NO(lsn);

As after all, only one thread can be here at once.

> +  }
> @@ -8113,6 +8427,8 @@
>  my_bool translog_purge(TRANSLOG_ADDRESS low)
>  {
>    uint32 last_need_file= LSN_FILE_NO(low);
> +  uint32 min_unsync;
> +  int soft;
>    TRANSLOG_ADDRESS horizon= translog_get_horizon();
>    int rc= 0;
>    DBUG_ENTER("translog_purge");
> @@ -8120,12 +8436,23 @@
>    DBUG_ASSERT(translog_status == TRANSLOG_OK ||
>                translog_status == TRANSLOG_READONLY);
>  
> +  soft= soft_sync;
> +  DBUG_PRINT("info", ("min_unsync: %lu", (ulong) min_unsync));
> +  if (soft && min_unsync < last_need_file)
> +  {
> +    last_need_file= min_unsync;
> +    DBUG_PRINT("info", ("last_need_file set to %lu", (ulong)last_need_file));
> +  }

The above must be a bug as you are never giving min_unsync a value.

>  
> +void translog_sync()
> +{
> +  uint32 max= get_current_logfile()->number;
> +  uint32 min;
> +  DBUG_ENTER("ma_translog_sync");
> +
> +  my_atomic_rwlock_rdlock(&soft_sync_rwl);
> +  min= my_atomic_load32(&soft_sync_min);
> +  my_atomic_rwlock_rdunlock(&soft_sync_rwl);

Don't understand why you need to do atomic operations above:
- You are only reading on value
- get_current_logfile() is read without a mutex, so the values are
  already independent.

> +/**
> +  @brief syncing service thread
> +*/
> +
> +static pthread_handler_t
> +ma_soft_sync_background( void *arg __attribute__((unused)))
> +{
> +
> +  my_thread_init();
> +  {
> +    DBUG_ENTER("ma_soft_sync_background");
> +    for(;;)
> +    {
> +      ulonglong prev_loop= my_micro_time();
> +      ulonglong time, sleep;
> +      uint32 min, max;
> +      my_atomic_rwlock_rdlock(&soft_sync_rwl);
> +      min= my_atomic_load32(&soft_sync_min);
> +      max= my_atomic_load32(&soft_sync_max);
> +      my_atomic_store32(&soft_sync_min, max);
> +      my_atomic_rwlock_rdunlock(&soft_sync_rwl);

Don't still understand what ensures that the above operations are safe
from other threads as my_atomic_rwlock_rdlock() and
my_atomic_rwlock_rdunlock() may be dummy operations.

> +
> +      sleep= group_commit_wait;
> +      translog_sync_files(min, max, FALSE);

It would be good to have a test above that if there is nothing to
sync, we don't call translog_sync_files().

Regards,
Monty
Follow ups

Re: Rev 2740: Group commit for maria storage engine. in file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit/
From: Oleksandr Byelkin, 2010-02-10
References

Rev 2740: Group commit for maria storage engine. in file:///Users/bell/maria/bzr/work-maria-5.2-groupcommit/
From: sanja, 2010-02-09