← Back to team overview

maria-developers team mailing list archive

Re: MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery

 

Hello Serg and Sergey,

Serg, please review a new version of the patch.


Sergey, thanks for noticing the problem with a missing
res->is_alloced() part in the condition. This fixed the
original problem reported in the bug.


I also had to modify Item_func_conv_charset::val_str(),
to make this query work well:

SELECT CONCAT(t2,'-',t2) c2 FROM (SELECT CONVERT(t USING latin1) t2 FROM
t1) sub;

(it's reported in an additional comment in MDEV).

This additional change prevents Item_func_concat from modifying
then buffer owned by Item_func_conv_charset.


On 03/02/2017 01:38 PM, Sergey Petrunia wrote:
> Hello,
> 
> On Wed, Mar 01, 2017 at 01:56:24PM +0100, Sergei Golubchik wrote:
>> Hi, Alexander!
>>
>> On Feb 28, Alexander Barkov wrote:
>>
>>> commit af8887b86ccbaea8782cf54fe445cf53aaef7c06
>>> Author: Alexander Barkov <bar@xxxxxxxxxxx>
>>> Date:   Tue Feb 28 10:28:09 2017 +0400
>>>
>>>     MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery
>>>     
>>>     The bug happens because of a combination of unfortunate circumstances:
>>>     
>>>     1. Arguments args[0] and args[2] of Item_func_concat point recursively
>>>     (through Item_direct_view_ref's) to the same Item_func_conv_charset.
>>>     Both args[0]->args[0]->ref[0] and args[2]->args[0]->ref[0] refer to
>>>     this Item_func_conv_charset.
>>>     
>>>     2. When Item_func_concat::args[0]->val_str() is called,
>>>     Item_func_conv_charset::val_str() writes its result to
>>>     Item_func_conc_charset::tmp_value.
>>>     
>>>     3. Then, for optimization purposes (to avoid copying),
>>>     Item_func_substr::val_str() initializes Item_func_substr::tmp_value
>>>     to point to the buffer fragment owned by Item_func_conv_charset::tmp_value
>>>     Item_func_substr::tmp_value is returned as a result of
>>>     Item_func_concat::args[0]->val_str().
>>>     
>>>     4. Due to optimization to avoid memory reallocs,
>>>     Item_func_concat::val_str() remembers the result of args[0]->val_str()
>>>     in "res" and further uses "res" to collect the return value.
>>>     
>>>     5. When Item_func_concat::args[2]->val_str() is called,
>>>     Item_func_conv_charset::tmp_value gets overwritten (see #1),
>>>     which effectively overwrites args[0]'s Item_func_substr::tmp_value (see #3),
>>>     which effectively overwrites "res" (see #4).
>>>     
>>>     The fix marks Item_func_substr::tmp_value as a constant string, which
>>>     tells Item_func_concat::val_str "Don't use me as the return value, you cannot
>>>     append to me because I'm pointing to a buffer owned by some other String".
>>
>> This pretty much looks like a hack, that makes the bug disappear in this
>> particular test case.
>>
>> What if SUBSTR() wasn't used? CONCAT would still modify
>> args[0]->tmp_value, and it would be overwritten by args[2]->val_str().
>>
>> On the other hand, if you remove args[2] from the test case, then CONCAT
>> can safely modify args[0]'s buffer and marking SUBSTR as const would
>> prevent a valid optimization.
>>
>> So, I see few possible approaches to this and other similar queries:
>>
>> 1. We specify that no Item's val method can modify the buffer of the
>>    arguments. That is, CONCAT will always have to copy. SUBSTR won't
>>    need to copy, because it doesn't modify the buffer, it only returns a
>>    pointer into it.
>>
>> 2. May be #1 is not strict enough, and we'll need to disallow pointers
>>    into the arguments' buffer too. Because, perhaps, args[2]->val_str()
>>    could realloc and then the pointer will become invalid.
>>
>> 3. A different approach would be to disallow one item to appear twice in
>>    an expression. No idea how to do that.
>>
>> 4. A variand of #3, an item can appear many times, but it'll be only
>>    evaluated once per row. That still needs #1, but #2 is unnecessary.
>>
>> Opinions? Ideas?
> 
> My take on the issue: I read the comment for Item::val_str:
> 
>       The caller of this method can modify returned string, but only in case
>       when it was allocated on heap, (is_alloced() is true).  This allows
>                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^(*)
>       the caller to efficiently use a buffer allocated by a child without
>       having to allocate a buffer of it's own. The buffer, given to
>       val_str() as argument, belongs to the caller and is later used by the
>       caller at it's own choosing.
>       A few implications from the above:
>       - unless you return a string object which only points to your buffer
>         but doesn't manages it you should be ready that it will be
>         modified.
>       - even for not allocated strings (is_alloced() == false) the caller
>         can change charset (see Item_func_{typecast/binary}. XXX: is this
>         a bug?
>       - still you should try to minimize data copying and return internal
>         object whenever possible.
> 
> And I see that Item_func_concat::val_str violates the rule (*) here:
> 
> =>    if (!is_const && res->alloced_length() >= res->length()+res2->length())
>       {						// Use old buffer
> 	res->append(*res2);
> 
> Here I see that
> (gdb) p *res
>   $81 = {Ptr = 0x7fffcb4203f0 "123---", str_length = 6, Alloced_length = 8,
>   extra_alloc = 0, alloced = false, thread_specific = false, str_charset =
>   0x175de80 <my_charset_latin1>}
> 
> (gdb) p res->is_alloced()
>   $83 = false
> 
> That is, the buffer is "not alloced", but the code modifies it by calling
> append().
> 
> So, I can get this bug' testcase to pass with this patch:
> 
> diff --git a/sql/item_strfunc.cc b/sql/item_strfunc.cc
> index 4ea3075..2e2cecd 100644
> --- a/sql/item_strfunc.cc
> +++ b/sql/item_strfunc.cc
> @@ -671,7 +671,8 @@ String *Item_func_concat::val_str(String *str)
>                             current_thd->variables.max_allowed_packet);
>         goto null;
>        }
> -      if (!is_const && res->alloced_length() >= res->length()+res2->length())
> +      if (!is_const && res->alloced_length() >= res->length()+res2->length()
> +          && res->is_alloced() /*psergey-added*/)
>        {                                                // Use old buffer
>         res->append(*res2);
>        }
> 
> 
> In general, I think we need to settle on some variant of #2.
> If that causes unnecessary data copying, perhaps we will need reference-counted
> strings or something like that ( I could think more about it but prefer to
> discuss this in real time in order not to duplicate the work).
> 
> BR
>  Sergei
> 
commit 504668f5dcdd91ca7dd537bdcb292e7173e4069b
Author: Alexander Barkov <bar@xxxxxxxxxxx>
Date:   Fri Mar 3 07:49:36 2017 +0400

    MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery
    
    The bug happens because of a combination of unfortunate circumstances:
    
    1. Arguments args[0] and args[2] of Item_func_concat point recursively
    (through Item_direct_view_ref's) to the same Item_func_conv_charset.
    Both args[0]->args[0]->ref[0] and args[2]->args[0]->ref[0] refer to
    this Item_func_conv_charset.
    
    2. When Item_func_concat::args[0]->val_str() is called,
    Item_func_conv_charset::val_str() writes its result to
    Item_func_conc_charset::tmp_value.
    
    3. Then, for optimization purposes (to avoid copying),
    Item_func_substr::val_str() initializes Item_func_substr::tmp_value
    to point to the buffer fragment owned by Item_func_conv_charset::tmp_value
    Item_func_substr::tmp_value is returned as a result of
    Item_func_concat::args[0]->val_str().
    
    4. Due to optimization to avoid memory reallocs,
    Item_func_concat::val_str() remembers the result of args[0]->val_str()
    in "res" and further uses "res" to collect the return value.
    
    5. When Item_func_concat::args[2]->val_str() is called,
    Item_func_conv_charset::tmp_value gets overwritten (see #1),
    which effectively overwrites args[0]'s Item_func_substr::tmp_value (see #3),
    which effectively overwrites "res" (see #4).
    
    This patch does the following:
    
    - Adds an additional res->is_alloced() part to the condition
      that decides if res2 can be directly appended to res.
      Without this condition, Item_func_concat violated the rules about val_str().
      See the comments about Item::val_str() in item.h.
      Thanks for Sergey P. for proposing this solution.
      This change fixes the original reported problem (with SUBSTR).
      The result of SUBSTR() cannot be used to accumulate the CONCAT()
      result any more, because it's not alloced.
    
    - Changes Item_func_conv_charset::val_str(String *str) to use
      tmp_value and str the other way around. After this change tmp_value
      is used to store a temporary result, while str is used to return the value.
      The fixes the second problem (without SUBSTR):
        SELECT CONCAT(t2,'-',t2) c2
          FROM (SELECT CONVERT(t USING latin1) t2 FROM t1) sub;
      As Item_func_concat::val_str() supplies two different buffers when calling
      args[0]->val_str() and args[2]->val_str(), in the new reduction the result
      created during args[0]->val_str() does not get overwritten by
      args[2]->val_str().

diff --git a/mysql-test/r/func_concat.result b/mysql-test/r/func_concat.result
index 925158a..4033195 100644
--- a/mysql-test/r/func_concat.result
+++ b/mysql-test/r/func_concat.result
@@ -149,3 +149,23 @@ CALL p1();
 ########################################40100.000
 DROP PROCEDURE p1;
 # End of 5.1 tests
+#
+# Start of 10.0 tests
+#
+#
+# MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery
+#
+SET @save_optimizer_switch=@@optimizer_switch;
+SET optimizer_switch='derived_merge=on';
+CREATE TABLE t1 (t VARCHAR(10) CHARSET latin1);
+INSERT INTO t1 VALUES('1234567');
+SELECT CONCAT(SUBSTR(t2, 1, 3), SUBSTR(t2, 5)) c1,
+CONCAT(SUBSTR(t2,1,3),'---',SUBSTR(t2,5)) c2
+FROM (SELECT CONVERT(t USING latin1) t2 FROM t1) sub;
+c1	c2
+123567	123---567
+SELECT CONCAT(t2,'-',t2) c2 FROM (SELECT CONVERT(t USING latin1) t2 FROM t1) sub;
+c2
+1234567-1234567
+DROP TABLE t1;
+SET optimizer_switch=@save_optimizer_switch;
diff --git a/mysql-test/t/func_concat.test b/mysql-test/t/func_concat.test
index e56d112..e84ccbf 100644
--- a/mysql-test/t/func_concat.test
+++ b/mysql-test/t/func_concat.test
@@ -145,3 +145,24 @@ CALL p1();
 DROP PROCEDURE p1;
 
 --echo # End of 5.1 tests
+
+
+--echo #
+--echo # Start of 10.0 tests
+--echo #
+
+--echo #
+--echo # MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery
+--echo #
+
+SET @save_optimizer_switch=@@optimizer_switch;
+SET optimizer_switch='derived_merge=on';
+
+CREATE TABLE t1 (t VARCHAR(10) CHARSET latin1);
+INSERT INTO t1 VALUES('1234567');
+SELECT CONCAT(SUBSTR(t2, 1, 3), SUBSTR(t2, 5)) c1,
+       CONCAT(SUBSTR(t2,1,3),'---',SUBSTR(t2,5)) c2
+   FROM (SELECT CONVERT(t USING latin1) t2 FROM t1) sub;
+SELECT CONCAT(t2,'-',t2) c2 FROM (SELECT CONVERT(t USING latin1) t2 FROM t1) sub;
+DROP TABLE t1;
+SET optimizer_switch=@save_optimizer_switch;
diff --git a/sql/item_strfunc.cc b/sql/item_strfunc.cc
index 4ea3075..f5a925f 100644
--- a/sql/item_strfunc.cc
+++ b/sql/item_strfunc.cc
@@ -671,7 +671,8 @@ String *Item_func_concat::val_str(String *str)
 			    current_thd->variables.max_allowed_packet);
 	goto null;
       }
-      if (!is_const && res->alloced_length() >= res->length()+res2->length())
+      if (!is_const && res->alloced_length() >= res->length()+res2->length() &&
+          res->is_alloced() /*psergey-added*/)
       {						// Use old buffer
 	res->append(*res2);
       }
@@ -3378,16 +3379,16 @@ String *Item_func_conv_charset::val_str(String *str)
   DBUG_ASSERT(fixed == 1);
   if (use_cached_value)
     return null_value ? 0 : &str_value;
-  String *arg= args[0]->val_str(str);
+  String *arg= args[0]->val_str(&tmp_value);
   uint dummy_errors;
   if (args[0]->null_value)
   {
     null_value=1;
     return 0;
   }
-  null_value= tmp_value.copy(arg->ptr(), arg->length(), arg->charset(),
-                             conv_charset, &dummy_errors);
-  return null_value ? 0 : check_well_formed_result(&tmp_value);
+  null_value= str->copy(arg->ptr(), arg->length(), arg->charset(),
+                        conv_charset, &dummy_errors);
+  return null_value ? 0 : check_well_formed_result(str);
 }
 
 void Item_func_conv_charset::fix_length_and_dec()

Follow ups

References