maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #10439
Re: MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery
Hello,
On Wed, Mar 01, 2017 at 01:56:24PM +0100, Sergei Golubchik wrote:
> Hi, Alexander!
>
> On Feb 28, Alexander Barkov wrote:
>
> > commit af8887b86ccbaea8782cf54fe445cf53aaef7c06
> > Author: Alexander Barkov <bar@xxxxxxxxxxx>
> > Date: Tue Feb 28 10:28:09 2017 +0400
> >
> > MDEV-10306 Wrong results with combination of CONCAT, SUBSTR and CONVERT in subquery
> >
> > The bug happens because of a combination of unfortunate circumstances:
> >
> > 1. Arguments args[0] and args[2] of Item_func_concat point recursively
> > (through Item_direct_view_ref's) to the same Item_func_conv_charset.
> > Both args[0]->args[0]->ref[0] and args[2]->args[0]->ref[0] refer to
> > this Item_func_conv_charset.
> >
> > 2. When Item_func_concat::args[0]->val_str() is called,
> > Item_func_conv_charset::val_str() writes its result to
> > Item_func_conc_charset::tmp_value.
> >
> > 3. Then, for optimization purposes (to avoid copying),
> > Item_func_substr::val_str() initializes Item_func_substr::tmp_value
> > to point to the buffer fragment owned by Item_func_conv_charset::tmp_value
> > Item_func_substr::tmp_value is returned as a result of
> > Item_func_concat::args[0]->val_str().
> >
> > 4. Due to optimization to avoid memory reallocs,
> > Item_func_concat::val_str() remembers the result of args[0]->val_str()
> > in "res" and further uses "res" to collect the return value.
> >
> > 5. When Item_func_concat::args[2]->val_str() is called,
> > Item_func_conv_charset::tmp_value gets overwritten (see #1),
> > which effectively overwrites args[0]'s Item_func_substr::tmp_value (see #3),
> > which effectively overwrites "res" (see #4).
> >
> > The fix marks Item_func_substr::tmp_value as a constant string, which
> > tells Item_func_concat::val_str "Don't use me as the return value, you cannot
> > append to me because I'm pointing to a buffer owned by some other String".
>
> This pretty much looks like a hack, that makes the bug disappear in this
> particular test case.
>
> What if SUBSTR() wasn't used? CONCAT would still modify
> args[0]->tmp_value, and it would be overwritten by args[2]->val_str().
>
> On the other hand, if you remove args[2] from the test case, then CONCAT
> can safely modify args[0]'s buffer and marking SUBSTR as const would
> prevent a valid optimization.
>
> So, I see few possible approaches to this and other similar queries:
>
> 1. We specify that no Item's val method can modify the buffer of the
> arguments. That is, CONCAT will always have to copy. SUBSTR won't
> need to copy, because it doesn't modify the buffer, it only returns a
> pointer into it.
>
> 2. May be #1 is not strict enough, and we'll need to disallow pointers
> into the arguments' buffer too. Because, perhaps, args[2]->val_str()
> could realloc and then the pointer will become invalid.
>
> 3. A different approach would be to disallow one item to appear twice in
> an expression. No idea how to do that.
>
> 4. A variand of #3, an item can appear many times, but it'll be only
> evaluated once per row. That still needs #1, but #2 is unnecessary.
>
> Opinions? Ideas?
My take on the issue: I read the comment for Item::val_str:
The caller of this method can modify returned string, but only in case
when it was allocated on heap, (is_alloced() is true). This allows
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^(*)
the caller to efficiently use a buffer allocated by a child without
having to allocate a buffer of it's own. The buffer, given to
val_str() as argument, belongs to the caller and is later used by the
caller at it's own choosing.
A few implications from the above:
- unless you return a string object which only points to your buffer
but doesn't manages it you should be ready that it will be
modified.
- even for not allocated strings (is_alloced() == false) the caller
can change charset (see Item_func_{typecast/binary}. XXX: is this
a bug?
- still you should try to minimize data copying and return internal
object whenever possible.
And I see that Item_func_concat::val_str violates the rule (*) here:
=> if (!is_const && res->alloced_length() >= res->length()+res2->length())
{ // Use old buffer
res->append(*res2);
Here I see that
(gdb) p *res
$81 = {Ptr = 0x7fffcb4203f0 "123---", str_length = 6, Alloced_length = 8,
extra_alloc = 0, alloced = false, thread_specific = false, str_charset =
0x175de80 <my_charset_latin1>}
(gdb) p res->is_alloced()
$83 = false
That is, the buffer is "not alloced", but the code modifies it by calling
append().
So, I can get this bug' testcase to pass with this patch:
diff --git a/sql/item_strfunc.cc b/sql/item_strfunc.cc
index 4ea3075..2e2cecd 100644
--- a/sql/item_strfunc.cc
+++ b/sql/item_strfunc.cc
@@ -671,7 +671,8 @@ String *Item_func_concat::val_str(String *str)
current_thd->variables.max_allowed_packet);
goto null;
}
- if (!is_const && res->alloced_length() >= res->length()+res2->length())
+ if (!is_const && res->alloced_length() >= res->length()+res2->length()
+ && res->is_alloced() /*psergey-added*/)
{ // Use old buffer
res->append(*res2);
}
In general, I think we need to settle on some variant of #2.
If that causes unnecessary data copying, perhaps we will need reference-counted
strings or something like that ( I could think more about it but prefer to
discuss this in real time in order not to duplicate the work).
BR
Sergei
--
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog
Follow ups
References