maria-developers team mailing list archive

Thread
Date
Re: Patch for MDEV 5873

To: Rohit Kalhans <rohit.kalhans@xxxxxxxxx>
From: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>
Date: Tue, 25 Mar 2014 15:44:18 +0100
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CAB=DUgkBHZFgaxusKh=651UiNUVuOyqnHd6gEBmvQYVKL9d6xQ@mail.gmail.com> (Rohit Kalhans's message of "Mon, 24 Mar 2014 16:59:18 +0530")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
Rohit Kalhans <rohit.kalhans@xxxxxxxxx> writes:

> As mentioned earlier i have a patch for MDEV 5873 and i am attaching here
> in the bundle of the patch.  I have run tests on this patch and all of them
> have passed.

> Kindly review the patch. Also i am not sure which is the suitable mailing
> list for sending patches, so it will be great if you can point me to the
> same.

Thanks for the patch! It's fine to send it to
maria-developers@xxxxxxxxxxxxxxxxxxx, as you did. Here is my review:

I see two main problems with this patch:

 - You moved the element::list from the upper level hash (on just domain_id)
   to the lower-level (domain_id, server_id). This seems not correct, as list
   is associated with the mysql.gtid_slave_pos table, which only holds data
   about one GTID per domain.

 - The patch relies on the master to send GTID_LIST events with the full
   binlog state - sequence number for each (domain_id,server_id) tuple. But I
   do not see anything in your patch that makes sure that those values are
   updated into the rpl_slave_state on the slave. Did I miss it? As I can see,
   Gtid_list_log_event::do_apply_event() only does such updates for events
   with the FLAG_IGN_GTIDS flag set; and these are not generated on the
   master, they are generated on the slave for a different purpose, related to
   slave event filters. You don't seem to have any test case that actually
   tests that this works, either...

I have some detailed comments on the patch below, mostly related to the first
of these points.

More generally, I think we need to become more clear about the exact semantics
of MASTER_GTID_WAIT(). I tried to think through the problem; here are my
thoughts, let me know what you think:

First, MASTER_GTID_WAIT() waits for a specific _position_ in the master's
binlog to be reached. It does not wait for some GTID to be applied on the
slave.

So here is what MASTER_GTID_WAIT("D-S1-M") should do:

1. Look up D in gtid_slave_state, say "D-S2-N"
2. Determine whether D-S1-M comes before or after D-S2-N in the binlog order.

Second, MariaDB GTID has two different modes, called "strict" and
"non-strict".

Strict mode is the recommended. In strict mode, we assume (and enforce as well
as possible) that binlog order coincides with sequence number order. This
makes step (2) as simple as comparing M and N.

In non-strict mode we avoid assuming that sequence number order is identical
to binlog order. This is necessary to not break backwards compatibility (since
GTID is on by default after an upgrade to 10.0). It is also useful to be able
to deal with mistakes that cause out-of-order binlogs between servers,
something that is not possible to prevent 100% given the nature of
asynchronous replication. In non-strict mode, step (2) requires looking at the
master's binlog to decide the order of D-S1-M and D-S2-N (if S1 != S2).

Out-of-order binlogs creates a lot of complexity in various corner cases. This
is seen as undesirable, especially by high-end users that employ a lot of
discipline in how they manage replication. Thus the main design goal is that
in strict mode, things should be simple like if out-of-order sequence numbers
were impossible. The secondary goal is that in non-strict mode, we try to deal
with things in a reasonable way as well as possible, but accept that some
semantics become more complex.

The complexity in case of MASTER_GTID_WAIT() is mainly what happens if the
slave is not connected to the master. In this case we will have missing or
incomplete information about binlog order, so may make incorrect or
inconsistent decisions related to MASTER_GTID_WAIT(). But this complexity is
avoided if we can assume that binlog order equals sequence number order.

So this is my suggestion:

 - In strict mode, MASTER_GTID_WAIT() only compares the sequence numbers, like
   the code before your patch.

 - In non-strict mode, MASTER_GTID_WAIT() works per (domain_id,server_id),
   like in your patch.

 - In non-strict mode, we give a warning if the slave specified in
   @@default_master_connection is not running.

This achieves the goal that in strict mode, things work simple like one would
expect, and in non-strict mode, we are able to deal reasonably with most cases
even if we have out-of-order sequence numbers.

The one thing I do not like about this is the following situation: Binlog on
master contains 0-1-1,0-1-2,0-2-3. Slave has position 0-2-3. We do
MASTER_GTID_WAIT("0-1-2"). If the slave is stopped, and gtid strict mode is
off, then this will hang until the slave is started.

It is not perfect. But I do not see much that can be done about it. At least
the user got a warning, and they can use the recommended strict mode to avoid
the issue.

For the implementation of this, I think there could be one element per
(domain_id,server_id) and one extra "global" element just for the
(domain_id). When we update the slave state, we update both the element for
(domain_id,server_id) and the one for (domain_id). When we execute
MASTER_GTID_WAIT(), we wait for the (domain_id) element in gtid strict mode,
and for the (domain_id,server_id) element in non-strict mode.

What do you think? Any better ideas?

Detailed comments:

> @@ -112,11 +112,18 @@
>    };
>  
>    /* Elements in the HASH that hold the state for one domain_id. */
> +  struct dom_element
> +  {
> +    uint32 domain_id;
> +    HASH hash;
> +  };
> +
> +  /* Elements in the HASH that hold the state for one server_id. */
>    struct element
>    {
>      struct list_element *list;
> -    uint32 domain_id;
> -    /* Highest seq_no seen so far in this domain. */
> +    uint32 server_id;
> +    /* Highest seq_no seen so far in this (domain,server_id). */
>      uint64 highest_seq_no;
>      /*
>        If this is non-NULL, then it is the waiter responsible for the small

I think you need to keep also the list_element *list in the dom_element.
Otherwise it subtly changes the semantics of how the mysql.gtid_slave_state
table is managed.

> +
> +# MDEV-5873
> +# We use a gtid with the same domain and seq no as one of the trx already
> +# applied but with an impossible server id.
> +# This should return with a timeout (returns -1)
> +SELECT  master_gtid_wait("1-99-1", 2);
> +
>  # Now actually wait until the slave reaches the position
>  send SELECT master_gtid_wait(@pos);
>  
> 

I think we need a bit more testing than this.
Here are some suggestions:

Create a binlog with something like 0-1-2,0-1-3,0-2-2. Setup two waiters, one
MASTER_GTID_WAIT("0-1-2",100), one MASTER_GTID_WAIT("0-2-2",1). Do START SLAVE
UNTIL MASTER_GTID_POS="0-1-3". Check that the first wait completes and the
second times out.

With binlog 0-1-1,0-1-2,0-2-3, let the slave have gtid_slave_pos=0-2-3.
Restart the slave server (with --skip-slave-start), check that
MASTER_GTID_WAIT("0-1-2") times out in non-strict mode (with a warning), but
completes in gtid strict mode. Start the slave, check that
MASTER_GTID_WAIT("0-1-2") now completes in both strict and non-strict
mode. Stop the slave, check that MASTER_GTID_WAIT("0-1-2") still completes
immediately in both modes (because now we have the master's binlog state still
in-memory). Manually SET GLOBAL gtid_slave_pos="0-2-3", check that this clears
the whole state so that now MASTER_GTID_WAIT("0-1-2") again times out in
strict mode.

> @@ -112,17 +121,16 @@
>    inited= true;
>  }
>  
> -
> -void
> -rpl_slave_state::truncate_hash()
> +void truncate_server_id_hash(HASH *hash)
>  {
>    uint32 i;
>  
> -  for (i= 0; i < hash.records; ++i)
> +  for (i= 0; i < hash->records; ++i)
>    {
> -    element *e= (element *)my_hash_element(&hash, i);
> -    list_element *l= e->list;
> -    list_element *next;
> +    rpl_slave_state::element *e= (rpl_slave_state::element *)
> +                                                      my_hash_element(hash, i);
> +    rpl_slave_state::list_element *l= e->list;
> +    rpl_slave_state::list_element *next;
>      while (l)
>      {
>        next= l->next;
> @@ -131,6 +139,18 @@
>      }
>      /* The element itself is freed by the hash element free function. */
>    }
> +  my_hash_reset(hash);
> +}
> +
> +void
> +rpl_slave_state::truncate_hash()
> +{
> +  uint32 i;
> +  for (i= 0; i < hash.records; ++i)
> +  {
> +    dom_element *de= (dom_element *)my_hash_element(&hash, i);
> +    truncate_server_id_hash(&de->hash);
> +  }
>    my_hash_reset(&hash);

If you keep the list in the upper (domain) element, you don't need to change
this function.

> @@ -152,7 +172,7 @@
>    element *elem= NULL;
>    list_element *list_elem= NULL;
>  
> -  if (!(elem= get_element(domain_id)))
> +  if (!(elem= get_element(domain_id, server_id)))

This should still be get_element(domain_id), well you can rename it to
get_dom_element() if you like to match the new name struct dom_element.
The list update will still work on the upper dom_element, and then to handle
the MASTER_GTID_WAIT() update you need to do an update in the inner hash.


> @@ -185,35 +205,67 @@
>  
>  
>  struct rpl_slave_state::element *
> -rpl_slave_state::get_element(uint32 domain_id)
> +rpl_slave_state::get_element(uint32 domain_id, uint32 server_id)
>  {
> -  struct element *elem;
> +  struct element *elem= 0;
> +  struct dom_element *d_elem= 0;
> +  bool domain_exists= false;// assuming domain does not exist already
>  
> -  elem= (element *)my_hash_search(&hash, (const uchar *)&domain_id, 0);
> +  d_elem= (dom_element *)my_hash_search(&hash, (const uchar *)&domain_id, 0);
> +  if (d_elem)
> +  {
> +    domain_exists= true; // domain exists
> +    elem= (element *)my_hash_search(&d_elem->hash, (const uchar *)&server_id, 0);
> +  }
>    if (elem)
>      return elem;
> -
> -  if (!(elem= (element *)my_malloc(sizeof(*elem), MYF(MY_WME))))
> +  // creare new domain element.
> +  if (!domain_exists)
> +  {
> +    if (!(d_elem= (dom_element *)my_malloc(sizeof(*d_elem), MYF(MY_WME))))
> +      return NULL;
> +
> +    my_hash_init(&d_elem->hash, &my_charset_bin, 32,
> +                 offsetof(element, server_id), sizeof(uint32),
> +                 NULL, rpl_slave_state_free_element, HASH_UNIQUE);
> +    // set domain id
> +    d_elem->domain_id= domain_id;
> +    // insert this domain in the hash.
> +    if (my_hash_insert(&hash, (uchar *)d_elem))
> +    {
> +      my_free(d_elem);
> +      return NULL;
> +    }
> +  }
> +
> +  if (!(elem=(element*)my_malloc(sizeof(*elem), MYF(MY_WME))))
>      return NULL;
> +
> +  mysql_cond_init(key_COND_wait_gtid, &elem->COND_wait_gtid, 0);
>    elem->list= NULL;
> -  elem->domain_id= domain_id;
> +  elem->server_id= server_id;
>    elem->highest_seq_no= 0;
>    elem->gtid_waiter= NULL;
> -  mysql_cond_init(key_COND_wait_gtid, &elem->COND_wait_gtid, 0);
> -  if (my_hash_insert(&hash, (uchar *)elem))
> +
> +  if(my_hash_insert(&d_elem->hash, (uchar *)elem))
>    {
>      my_free(elem);
>      return NULL;
>    }
> +
>    return elem;
>  }

I think we still need a get_element() that works only on the domain, it is
used in a few places (check_duplicate_gtid(), release_domain_owner(), and
record_gtid()). So maybe another function that looks up in the sub
(domain_id,server_id) hash. Or maybe a separate function that first calls the
get_element() and then does the second lookup.

>  int
> -rpl_slave_state::put_back_list(uint32 domain_id, list_element *list)
> +rpl_slave_state::put_back_list(uint32 domain_id, uint32 server_id,
> +                               list_element *list)
>  {
> +  dom_element* de;
>    element *e;
> -  if (!(e= (element *)my_hash_search(&hash, (const uchar *)&domain_id, 0)))
> +  if (!(de= (dom_element *)my_hash_search(&hash, (const uchar *)&domain_id, 0)))
> +    return 1;
> +  if (!(e=(element *)my_hash_search(&de->hash,(const uchar*)&server_id, 0)))
>      return 1;
>    while (list)
>    {

Again, the list should be on the upper layer, so that this is not necessary.


> @@ -556,7 +608,7 @@
>  rpl_slave_state::iterate(int (*cb)(rpl_gtid *, void *), void *data,
>                           rpl_gtid *extra_gtids, uint32 num_extra)
>  {
> -  uint32 i;
> +  uint32 i, j;
>    HASH gtid_hash;
>    uchar *rec;
>    rpl_gtid *gtid;
> @@ -573,25 +625,34 @@
>  
>    for (i= 0; i < hash.records; ++i)
>    {
> -    uint64 best_sub_id;
> +    uint64 best_sub_id= 0;
>      rpl_gtid best_gtid;
> -    element *e= (element *)my_hash_element(&hash, i);
> -    list_element *l= e->list;
> -
> -    if (!l)
> -      continue;                                 /* Nothing here */
> -
> -    best_gtid.domain_id= e->domain_id;
> -    best_gtid.server_id= l->server_id;
> -    best_gtid.seq_no= l->seq_no;
> -    best_sub_id= l->sub_id;
> -    while ((l= l->next))
> +    bool first= true;
> +    dom_element *de= (dom_element *)my_hash_element(&hash, i);
> +    for ( j= 0; j < de->hash.records; ++j)
>      {
> -      if (l->sub_id > best_sub_id)
> +      element* e= (element *)my_hash_element(&de->hash, j);
> +      list_element *l= e->list;
> +
> +      if (!l)
> +        continue;                                 /* Nothing here */
> +      if (first) // executed once per domain.
>        {
>          best_sub_id= l->sub_id;
> +        best_gtid.domain_id= de->domain_id;
>          best_gtid.server_id= l->server_id;
>          best_gtid.seq_no= l->seq_no;
> +        first= false;
> +      }
> +      while(l)
> +      {
> +        if (l->sub_id > best_sub_id)
> +        {
> +          best_sub_id= l->sub_id;
> +          best_gtid.server_id= l->server_id;
> +          best_gtid.seq_no= l->seq_no;
> +        }
> +        l = l->next;
>        }
>      }
>  

I don't think rpl_slave_state::iterate() should need to change either, if you
put the list on the upper (domain) hash elements.

> @@ -683,30 +744,48 @@
>  bool
>  rpl_slave_state::domain_to_gtid(uint32 domain_id, rpl_gtid *out_gtid)
>  {
> -  element *elem;
> +  dom_element *d_elem;
> +  element* elem;
>    list_element *list;
>    uint64 best_sub_id;
> +  uint32 i;
> +  bool first= true;
>  
>    mysql_mutex_lock(&LOCK_slave_state);
> -  elem= (element *)my_hash_search(&hash, (const uchar *)&domain_id, 0);
> -  if (!elem || !(list= elem->list))
> +  d_elem= (dom_element *)my_hash_search(&hash, (const uchar *)&domain_id, 0);
> +  if (!d_elem)
>    {
>      mysql_mutex_unlock(&LOCK_slave_state);
>      return false;
>    }
> -
> -  out_gtid->domain_id= domain_id;
> -  out_gtid->server_id= list->server_id;
> -  out_gtid->seq_no= list->seq_no;
> -  best_sub_id= list->sub_id;
> -
> -  while ((list= list->next))
> +  for (i= 0; i < d_elem->hash.records; ++i)
>    {
> -    if (best_sub_id > list->sub_id)
> -      continue;
> -    best_sub_id= list->sub_id;
> -    out_gtid->server_id= list->server_id;
> -    out_gtid->seq_no= list->seq_no;
> +    elem= (element*)my_hash_element(&d_elem->hash, i);
> +    if (!elem || !(list= elem->list))
> +    {
> +      mysql_mutex_unlock(&LOCK_slave_state);
> +      return false;
> +    }
> +    if (first)
> +    {
> +      out_gtid->domain_id= domain_id;
> +      out_gtid->server_id= list->server_id;
> +      out_gtid->seq_no= list->seq_no;
> +      best_sub_id= list->sub_id;
> +      first= false;
> +    }
> +    while (list)
> +    {
> +      if (best_sub_id > list->sub_id)
> +      {
> +        list= list->next;
> +        continue;
> +      }
> +      best_sub_id= list->sub_id;
> +      out_gtid->server_id= list->server_id;
> +      out_gtid->seq_no= list->seq_no;
> +      list= list->next;
> +    }
>    }
>  
>    mysql_mutex_unlock(&LOCK_slave_state);

rpl_slave_state::domain_to_gtid() also should not need any changes.

> @@ -836,18 +915,23 @@
>  bool
>  rpl_slave_state::is_empty()
>  {
> -  uint32 i;
> +  uint32 i, j;
>    bool result= true;
>  
>    mysql_mutex_lock(&LOCK_slave_state);
>    for (i= 0; i < hash.records; ++i)
>    {
> -    element *e= (element *)my_hash_element(&hash, i);
> -    if (e->list)
> +    dom_element *de= (dom_element *)my_hash_element(&hash, i);
> +    for (j= 0; j < de->hash.records; ++j)
>      {
> -      result= false;
> -      break;
> +      element *e= (element *)my_hash_element(&de->hash, j);
> +      if (e->list)
> +      {
> +        result= false;
> +        break;
> +      }
>      }
> +    if(!result) break;

If you keep the list in dom_element, then this change becomes unnecessary.

 - Kristian.
References

Patch for MDEV 5873
From: Rohit Kalhans, 2014-03-24