← Back to team overview

maria-developers team mailing list archive

Review of base64.diff and german2.diff

 

Hi!

Here is the review for the code that we should put into 10.0

First the base64:

> === modified file 'mysql-test/t/func_str.test'
> --- mysql-test/t/func_str.test	2013-05-07 11:05:09 +0000
> +++ mysql-test/t/func_str.test	2013-08-28 13:14:24 +0000
> @@ -1555,3 +1555,118 @@ drop table t1,t2;
>  --echo # End of 5.5 tests
>  --echo #
>  
> +
> +--echo #
> +--echo # Start of 5.6 tests
> +--echo #

Shouldn't this be start of 10.0 tests ?
(I know that this code is from MySQL 5.6, but still for us this is 10.0...)

> === modified file 'mysys/base64.c'
> --- mysys/base64.c	2011-06-30 15:46:53 +0000
> +++ mysys/base64.c	2013-03-09 06:22:59 +0000
> @@ -1,5 +1,4 @@
> -/* Copyright (c) 2003-2008 MySQL AB, 2009 Sun Microsystems, Inc.
> -   Use is subject to license terms.
> +/* Copyright (c) 2003, 2010, Oracle and/or its affiliates. All rights reserved.

Removed 'all rights reserved'. You can't have that for GPL code.

If there is any new code from us, please add a copyright message for the
MariaDB foundation too!

>  
>     This program is free software; you can redistribute it and/or modify
>     it under the terms of the GNU General Public License as published by
> @@ -25,6 +24,28 @@ static char base64_table[] = "ABCDEFGHIJ
>                               "abcdefghijklmnopqrstuvwxyz"
>                               "0123456789+/";
>  
> +/**
> + * Maximum length base64_needed_encoded_length()
> + * can handle without overflow.
> + */
> +int
> +base64_encode_max_arg_length()
> +{
> +#if (SIZEOF_INT == 8)
> +  /*
> +    6827690988321067803 ->   9223372036854775805
> +    6827690988321067804 ->  -9223372036854775807
> +  */

Please describe from where the following number comes from
(so that anyone can recalculate and check the number if needed):

> +  return 0x5EC0D4C77B03531BLL
> +#else
> +  /*
> +    1589695686 ->  2147483646
> +    1589695687 -> -2147483645
> +  */
> +  return 0x5EC0D4C6;
> +#endif
> +}
> +
>  
>  int
>  base64_needed_encoded_length(int length_of_data)
> @@ -39,6 +60,21 @@ base64_needed_encoded_length(int length_
>  }
>  
>  
> +/**
> + * Maximum length base64_needed_decoded_length()
> + * can handle without overflow.
> + */
> +int
> +base64_decode_max_arg_length()
> +{

The following also need a comment.

> +#if (SIZEOF_INT == 8)
> +  return 0x2AAAAAAAAAAAAAAALL;
> +#else
> +  return 0x2AAAAAAA;
> +#endif
> +}
> +
> +

<cut>

> +/**
> + * Get the next character from a base64 encoded stream.
> + * Skip spaces, then scan the next base64 character or a pad character
> + * and collect bits into decoder->c.
> + *
> + * @param  decoder  Pointer to MY_BASE64_DECODER
> + * @return
> + *  FALSE on success (a valid base64 encoding character found)
> + *  TRUE  on error (unexpected character or unexpected end-of-input found)
> + */

Add an empty line here.

It would also be good to have a good description of the base64 format
in this file so that one can understand what the functions are
supposed to do (or at least a link to the specification).

Some thing I figured out by asking bar on IRC:

- We never generate spaces
- When decoding, we allow spaces anywhere.
- The end of an 64 base  decoded string is always padded with '='
  so that the final length is dividable with 4.

> +static inline my_bool
> +my_base64_decoder_getch(MY_BASE64_DECODER *decoder)

Remove inline from the above; It's likely to make the code slower, not
faster as this is a big function with lot of if's.

>  {
> -  char b[3];
> -  size_t i= 0;
> -  char *dst_base= (char *)dst;
> -  char const *src= src_base;
> -  char *d= dst_base;
> -  size_t j;
> +  if (my_base64_decoder_skip_spaces(decoder))
> +    return TRUE; /* End-of-input */
>  
> -  while (i < len)
> +  if (!my_base64_add(decoder)) /* Valid base64 character found */
>    {
> -    unsigned c= 0;
> -    size_t mark= 0;
> +    if (decoder->mark)
> +    {
> +      /* If we have scanned '=' already, then only '=' is valid */
> +      DBUG_ASSERT(decoder->state == 3);
> +      decoder->error= 1;
> +      decoder->src--;
> +      return TRUE; /* expected '=', but encoding character found */
> +    }
> +    decoder->state++;
> +    return FALSE;
> +  }

If you want to have the funtion inline, then move the following to
another not inline function;  We don't need to generate code for this
for every function call:

> +
> +  /* Process error */
> +  switch (decoder->state)
> +  {
> +  case 0:
> +  case 1:
> +    decoder->src--;
> +    return TRUE; /* base64 character expected */
> +    break;
> +
> +  case 2:
> +  case 3:
> +    if (decoder->src[-1] == '=')
> +    {
> +      decoder->error= 0; /* Not an error - it's a pad character */
> +      decoder->mark++;
> +    }
> +    else
> +    {
> +      decoder->src--;
> +      return TRUE; /* base64 character or '=' expected */
> +    }
> +    break;
> +  default:
> +    DBUG_ASSERT(0);
> +    return TRUE; /* Wrong state, should not happen */
> +  }

<cut>

Add to the comment for base64_decode that the 'dst' buffer has to be
at least 2 character longer than what is needed to decode src_base.

> +int
> +base64_decode(const char *src_base, size_t len,
> +              void *dst, const char **end_ptr, int flags)
> +{
> +    if (my_base64_decoder_getch(&decoder) ||
> +        my_base64_decoder_getch(&decoder) ||
> +        my_base64_decoder_getch(&decoder) ||
> +        my_base64_decoder_getch(&decoder))
> +      break;

Generating the error handling with a switch for every of the
above is wasteful. See previous comment.

> +  /* Return error if there are more non-space characters */
> +  decoder.state= 0;
> +  if (!my_base64_decoder_skip_spaces(&decoder))
> +    decoder.error= 1;
> +
>    if (end_ptr != NULL)

Note that base64_needed_decoded_length() used ceil() when it's not needed.

<cut>

> === modified file 'sql/item_strfunc.cc'
> @@ -451,6 +452,101 @@ void Item_func_aes_decrypt::fix_length_a
>     set_persist_maybe_null(1);
>  }
>  
> +
> +void Item_func_to_base64::fix_length_and_dec()
> +{
> +  maybe_null= args[0]->maybe_null;
> +  collation.set(default_charset(), DERIVATION_COERCIBLE, MY_REPERTOIRE_ASCII);

Maybe better to cast both arguments to (ulonglong) as the rest of the code
is depending on this.

Also, why is base64_encode_max_arg_length() int instead of uint or
even better size_t?

> +  if (args[0]->max_length > (uint) base64_encode_max_arg_length())
> +  {
> +    maybe_null= 1;
> +    fix_char_length_ulonglong((ulonglong) base64_encode_max_arg_length());
> +  }
> +  else
> +  {
> +    int length= base64_needed_encoded_length((int) args[0]->max_length);
> +    DBUG_ASSERT(length > 0);

Don't think assert is needed as the function gurantees it already.

> +    fix_char_length_ulonglong((ulonglong) length - 1);
> +  }
> +}

<cut>

> +String *Item_func_from_base64::val_str(String *str)
> +{
> +  String *res= args[0]->val_str_ascii(str);
> +  bool too_long= false;
> +  int length;
> +  const char *end_ptr;
> +
> +  if (!res ||
> +      res->length() > (uint) base64_decode_max_arg_length() ||
> +      (too_long=
> +       ((uint) (length= base64_needed_decoded_length((int) res->length())) >
> +        current_thd->variables.max_allowed_packet)) ||
> +      tmp_value.alloc((uint) length) ||
> +      (length= base64_decode(res->ptr(), (int) res->length(),
> +                             (char *) tmp_value.ptr(), &end_ptr, 0)) < 0 ||
> +      end_ptr < res->ptr() + res->length())
> +  {

Shouldn't we get a readable error or warning if the string contained wrong
characters?
Now it looks like we will just return null.

Something like 'Malformed base64 string. Error at position %d" would
be nice.


> +    null_value= 1; // NULL input, too long input, OOM, or badly formed input
> +    if (too_long)
> +    {
> +      push_warning_printf(current_thd, Sql_condition::WARN_LEVEL_WARN,
> +                          ER_WARN_ALLOWED_PACKET_OVERFLOWED,
> +                          ER(ER_WARN_ALLOWED_PACKET_OVERFLOWED), func_name(),
> +                          current_thd->variables.max_allowed_packet);
> +    }
> +    return 0;
> +  }
> +  tmp_value.length((uint) length);
> +  null_value= 0;
> +  return &tmp_value;
> +}

<cut>

Review of german2.diff

=== modified file 'include/my_global.h'
--- include/my_global.h	2013-07-21 14:39:19 +0000
+++ include/my_global.h	2013-08-29 08:27:09 +0000
@@ -1218,4 +1218,11 @@ static inline double rint(double x)
 #define HAVE_EXTERNAL_CLIENT
 #endif /* EMBEDDED_LIBRARY */
 
+
+enum loglevel {
+   ERROR_LEVEL=       0,
+   WARNING_LEVEL=     1,
+   INFORMATION_LEVEL= 2
+};
+
 #endif /* my_global_h */

Not sure why the above, which is a my_sys construct, should be in
my_global.h.  Please add at least a comment about this.

=== modified file 'mysys/charset.c'
--- mysys/charset.c	2012-08-14 14:23:34 +0000
+++ mysys/charset.c	2013-08-29 12:38:44 +0000
@@ -66,6 +66,9 @@ static my_bool init_state_maps(struct ch
   if (!(cs->ident_map= ident_map= (uchar*) my_once_alloc(256, MYF(MY_WME))))
     return 1;
 
+  state_map= (uchar *) cs->state_map;
+  ident_map= (uchar *) cs->ident_map;

Why is the above needed?
They are already set one line above.
(In addition cs->state_map and cs->ident_map are already uchar *)

--- strings/ctype-simple.c	2013-07-21 14:39:19 +0000
+++ strings/ctype-simple.c	2013-08-29 09:05:07 +0000
@@ -1163,12 +1163,12 @@ static int pcmp(const void * f, const vo
   return res;
 }
 
-static my_bool create_fromuni(struct charset_info_st *cs,
-                              void *(*alloc)(size_t))
+static my_bool
+create_fromuni(struct charset_info_st *cs,
+               MY_CHARSET_LOADER *loader)
 {
   uni_idx	idx[PLANE_NUM];
   int		i,n;
-  struct my_uni_idx_st *tab_from_uni;
   
   /*
     Check that Unicode map is loaded.
@@ -1209,18 +1209,18 @@ static my_bool create_fromuni(struct cha
   for (i=0; i < PLANE_NUM; i++)
   {
     int ch,numchars;
-    uchar *tab;
     
     /* Skip empty plane */
     if (!idx[i].nchars)
       break;
     
     numchars=idx[i].uidx.to-idx[i].uidx.from+1;
-    if (!(idx[i].uidx.tab= tab= (uchar*)
-          alloc(numchars * sizeof(*idx[i].uidx.tab))))
+    if (!(idx[i].uidx.tab= (uchar *)
+                           (loader->once_alloc) (numchars *
+                                                 sizeof(*idx[i].uidx.tab))))
       return TRUE;
     
-    bzero(tab,numchars*sizeof(*tab));
+    bzero((char*) idx[i].uidx.tab, numchars * sizeof(*idx[i].uidx.tab));
     
     for (ch=1; ch < PLANE_SIZE; ch++)
     {
@@ -1228,32 +1228,32 @@ static my_bool create_fromuni(struct cha
       if (wc >= idx[i].uidx.from && wc <= idx[i].uidx.to && wc)
       {
         int ofs= wc - idx[i].uidx.from;
-        tab[ofs]= ch;
+        ((char *) idx[i].uidx.tab)[ofs]= ch;
       }
     }
   }
   
Why remove tab to be a shortcut for idx[i].uidx.from ?
Shouldn't it be faster and shorter to use tab than using idx[i].uidx.from?
(At least in this place)



-static const uint16 page0FCdata[]= { /* FC00 (3 weights per char) */
+static uint16 page0FCdata[]= { /* FC00 (3 weights per char) */
 0x134F,0x135E,0x0000, 0x134F,0x1364,0x0000, 0x134F,0x13B0,0x0000,
 0x134F,0x13C7,0x0000, 0x134F,0x13C8,0x0000, 0x1352,0x135E,0x0000,
 0x1352,0x1364,0x0000, 0x1352,0x1365,0x0000, 0x1352,0x13B0,0x0000,
@@ -6006,7 +6005,7 @@ static const uint16 page0FCdata[]= { /*
 0x1381,0x13C8,0x0000, 0x1382,0x13C7,0x0000, 0x1382,0x13C8,0x0000,
 0x1364,0x13C7,0x0000 };
 
If we are not going to change the above, better to keep these as
const!

They will then be in the const segment and you will have protection
from anyone trying to change these!

<cut>

+my_uca_add_contraction(MY_CONTRACTIONS *list, my_wc_t *wc, size_t len,
+                       my_bool with_context)
 {
-  MY_CONTRACTIONS *list= (MY_CONTRACTIONS*) cs->contractions;
   MY_CONTRACTION *next= &list->item[list->nitems];
-  DBUG_ASSERT(len == 2); /* We currently support only contraction2 */
-  next->ch[0]= wc[0];
-  next->ch[1]= wc[1];
+  size_t i;
+  /*
+    Contraction is always at least 2 characters.
+    Contraction is never longer than MY_UCA_MAX_CONTRACTION,
+    which is guaranteed by using my_coll_rule_expand() with proper limit.
+  */
+  DBUG_ASSERT(len > 1 && len <= MY_UCA_MAX_CONTRACTION);
+  for (i= 0; i < len; i++)
+  {
+    /*
+      We don't support contractions with U+0000.
+      my_coll_rule_expand() guarantees there're no U+0000 in a contraction.
+    */
+    DBUG_ASSERT(wc[i] != 0);
+    next->ch[i]= wc[i];
+  }
+  if (i < MY_UCA_MAX_CONTRACTION)
+    next->ch[i]= 0; /* Add end-of-line marker */

Wouldn't it make life easier to always have the marker there?
(Ie, always have place for the marker).
At least you can remove one if from the code.  If there is more than
one if to remove, then it's a simple optimzation do to...

<cut>

+static int
+my_coll_parser_too_long_error(MY_COLL_RULE_PARSER *p, const char *name)
+{
+  my_snprintf(p->errstr, sizeof(p->errstr), "%s is too long", name);
+  return 0;
+}

You should limit '%s' to 64 characters so that one gets the name in
the output even if it's over sizeof()...

%-.64s ....

<cut>

Regards,
Monty


Follow ups