← Back to team overview

maria-developers team mailing list archive

Re: MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery

 

Hello,

On Wed, Mar 01, 2017 at 01:56:24PM +0100, Sergei Golubchik wrote:
> Hi, Alexander!
> 
> On Feb 28, Alexander Barkov wrote:
> 
> > commit af8887b86ccbaea8782cf54fe445cf53aaef7c06
> > Author: Alexander Barkov <bar@xxxxxxxxxxx>
> > Date:   Tue Feb 28 10:28:09 2017 +0400
> > 
> >     MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery
> >     
> >     The bug happens because of a combination of unfortunate circumstances:
> >     
> >     1. Arguments args[0] and args[2] of Item_func_concat point recursively
> >     (through Item_direct_view_ref's) to the same Item_func_conv_charset.
> >     Both args[0]->args[0]->ref[0] and args[2]->args[0]->ref[0] refer to
> >     this Item_func_conv_charset.
> >     
> >     2. When Item_func_concat::args[0]->val_str() is called,
> >     Item_func_conv_charset::val_str() writes its result to
> >     Item_func_conc_charset::tmp_value.
> >     
> >     3. Then, for optimization purposes (to avoid copying),
> >     Item_func_substr::val_str() initializes Item_func_substr::tmp_value
> >     to point to the buffer fragment owned by Item_func_conv_charset::tmp_value
> >     Item_func_substr::tmp_value is returned as a result of
> >     Item_func_concat::args[0]->val_str().
> >     
> >     4. Due to optimization to avoid memory reallocs,
> >     Item_func_concat::val_str() remembers the result of args[0]->val_str()
> >     in "res" and further uses "res" to collect the return value.
> >     
> >     5. When Item_func_concat::args[2]->val_str() is called,
> >     Item_func_conv_charset::tmp_value gets overwritten (see #1),
> >     which effectively overwrites args[0]'s Item_func_substr::tmp_value (see #3),
> >     which effectively overwrites "res" (see #4).
> >     
> >     The fix marks Item_func_substr::tmp_value as a constant string, which
> >     tells Item_func_concat::val_str "Don't use me as the return value, you cannot
> >     append to me because I'm pointing to a buffer owned by some other String".
> 
> This pretty much looks like a hack, that makes the bug disappear in this
> particular test case.
> 
> What if SUBSTR() wasn't used? CONCAT would still modify
> args[0]->tmp_value, and it would be overwritten by args[2]->val_str().
> 
> On the other hand, if you remove args[2] from the test case, then CONCAT
> can safely modify args[0]'s buffer and marking SUBSTR as const would
> prevent a valid optimization.
> 
> So, I see few possible approaches to this and other similar queries:
> 
> 1. We specify that no Item's val method can modify the buffer of the
>    arguments. That is, CONCAT will always have to copy. SUBSTR won't
>    need to copy, because it doesn't modify the buffer, it only returns a
>    pointer into it.
> 
> 2. May be #1 is not strict enough, and we'll need to disallow pointers
>    into the arguments' buffer too. Because, perhaps, args[2]->val_str()
>    could realloc and then the pointer will become invalid.
> 
> 3. A different approach would be to disallow one item to appear twice in
>    an expression. No idea how to do that.
> 
> 4. A variand of #3, an item can appear many times, but it'll be only
>    evaluated once per row. That still needs #1, but #2 is unnecessary.
> 
> Opinions? Ideas?

My take on the issue: I read the comment for Item::val_str:

      The caller of this method can modify returned string, but only in case
      when it was allocated on heap, (is_alloced() is true).  This allows
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^(*)
      the caller to efficiently use a buffer allocated by a child without
      having to allocate a buffer of it's own. The buffer, given to
      val_str() as argument, belongs to the caller and is later used by the
      caller at it's own choosing.
      A few implications from the above:
      - unless you return a string object which only points to your buffer
        but doesn't manages it you should be ready that it will be
        modified.
      - even for not allocated strings (is_alloced() == false) the caller
        can change charset (see Item_func_{typecast/binary}. XXX: is this
        a bug?
      - still you should try to minimize data copying and return internal
        object whenever possible.

And I see that Item_func_concat::val_str violates the rule (*) here:

=>    if (!is_const && res->alloced_length() >= res->length()+res2->length())
      {						// Use old buffer
	res->append(*res2);

Here I see that
(gdb) p *res
  $81 = {Ptr = 0x7fffcb4203f0 "123---", str_length = 6, Alloced_length = 8,
  extra_alloc = 0, alloced = false, thread_specific = false, str_charset =
  0x175de80 <my_charset_latin1>}

(gdb) p res->is_alloced()
  $83 = false

That is, the buffer is "not alloced", but the code modifies it by calling
append().

So, I can get this bug' testcase to pass with this patch:

diff --git a/sql/item_strfunc.cc b/sql/item_strfunc.cc
index 4ea3075..2e2cecd 100644
--- a/sql/item_strfunc.cc
+++ b/sql/item_strfunc.cc
@@ -671,7 +671,8 @@ String *Item_func_concat::val_str(String *str)
                            current_thd->variables.max_allowed_packet);
        goto null;
       }
-      if (!is_const && res->alloced_length() >= res->length()+res2->length())
+      if (!is_const && res->alloced_length() >= res->length()+res2->length()
+          && res->is_alloced() /*psergey-added*/)
       {                                                // Use old buffer
        res->append(*res2);
       }


In general, I think we need to settle on some variant of #2.
If that causes unnecessary data copying, perhaps we will need reference-counted
strings or something like that ( I could think more about it but prefer to
discuss this in real time in order not to duplicate the work).

BR
 Sergei
-- 
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog




Follow ups

References