← Back to team overview

maria-developers team mailing list archive

Re: Please review MDEV-8092 Change Create_field::field from "const char *" to LEX_CSTRING

 

Hi Sergei,


On 01/08/2016 03:48 PM, Sergei Golubchik wrote:
Hi, Alexander!

On Jan 08, Alexander Barkov wrote:
diff --git a/sql/field.h b/sql/field.h
index 0591914..a40e307 100644
--- a/sql/field.h
+++ b/sql/field.h
@@ -537,6 +537,115 @@ inline bool is_temporal_type_with_time(enum_field_types type)
   }


+/**
+  @class Name
+  An arbitrary SQL name, consisting of a NULL-terminated string
+  and its length.
+  Later it can be reused for table and schema names.
+*/
+class Name
+{
+  const char *m_str;
+  uint m_length;

Sorry, we have too many "string" classes already.
Either use LEX_STRING or create a Lex_string
which is inherited from LEX_STRING, binary compatible with it
and only adds C++ methods (and can be casted back and forth automatically).
So that one could use Lex_string where a LEX_STRING is expected and vice
versa

Sorry for a long reply.

This is a very convenient moment of time when
we can improve a few things.

I agree with this, to a certain extent. Cleanups and refactoring is
good. But I'm starting to feel that you do too many too big changes and
that makes the task MDEV much larger than it was supposed to be. And
much delayed too.

Wouldn't it be a good time now to actually start implementing a feature?

I disagree to go the easier way with LEX_STRING.
LEX_STRING is a very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms.
But now on 64bit platforms it's just a waste of memory.
Length can never be large enough to use the entire "size_t".

I wouldn't sweat over it. No matter what you do with LEX_STRING it won't
decrease server memory requirements in any noticeable way.

Monty proposes to collect some of the bool flags in Items into
bit mask, to save a few bytes, because he thinks it's important.
You propose to lose 4 or 6 bytes, because it does not affect
server memory in any noticeable ways and thus is not important.

I agree with Monty here.
Saving a few bytes is at least not harmful and should affect
performance positively. A little bit, but still positively.

I am not afraid of having more string classes.



Btw, it can even never be long enough to use "uint". It should be enough
to use uint16 for length and use the available remainder (6 bytes) to
store extra useful things, like string metadata.
It will help to avoid some operations, like character set conversion
in the client-server protocol, and I planned to add some optimization
in the future.

I could say that one "should not use LEX_STRING where character sets are
involved", but let's postpone that till the end of the email.

When column metadata is sent to the client side, there is a character
conversion from utf8 to character_set_client.
If we remember the column name metadata inside the Name class,
we can save on conversion and use faster non-converting methods instead.



I propose to do the other way around:
- gradually get rid of using LEX_STRING and LEX_CSTRING for names
in the server side and use more suitable Name-alike classes instead.

More suitable for *what*? See below.


More suitable to store names:
- store lengths, which never need more than 2 bytes, in something smaller than size_t
- use the remaining unaligned memory for metadata in the future
(this can be done after pluggable data types)


- keep LEX_XXX only for compatibility with the existing pure C API,
    where it already presents now.

I've checked, LEX_STRING is only used in the plugin API that is
versioned and we can safely change LEX_STRING in 10.2 to use, say,
uint16, if needed.

LEX_STRING is not always used for names.
It is also used to store the entire query, which is limited
by max_allowed_packed and can be up to 1G.

So there should be two separate things, with
- uint16 length, for names.
- uint32 length, for longer entities, like the entire query.
We currently do not seem to need size_t for lengths.

But in the 64-bit world it's very likely that we'll introduce >4Gb blobs
sooner or later, and will support >4Gb queries.

I propose not to change LEX_STRING.



As for "too many string classes", I have heard this a few times already,
both in MariaDB and earlier in Oracle :)

Perhaps, because it's true? :)


Sorry, I'll stay on my point of view here.
Let's measure the code not by amount of classes,
but in how these classes fit to store the data.

>> I don't agree with this statement.

I think that:

1. Trying to reuse classes that are not fully suitable for a task is
more harmful than having more classes, because this is simply
not efficient.

2. Nobody objects to have many Item or Field classes that suite
a particular functionality better. But some people are very afraid
of having more strings. Why?
Various strings that we have in the server code *are* different enough.
Different string do deserve different classes.

Precisely. So, to know what different strings are there in the server, I
want to list them all. Then we'll see whether how string classes fit in.

1. char* - zero terminated string. I don't see why one should ever use
    it (in the server).
2. uchar* - byte array, not zero terminated, no length. Even worse.
    Never ever.
3. LEX_STRING - a string and a length, no charset. Used in C with a
    preallocated buffer.
4. LEX_CSTRING - constant string, no charset.
5. LEX_CUSTRING - constant byte array? weird, but possible.
6. DYNAMIC_STRING - C string allocated on the heap. Can be appended to,
    reallocates automatically. No charset.
7. CSET_STRING - C++ LEX_STRING with a charset.
8. String - swiss army knife C++ string. Uses a preallocated buffer or
    heap. Grows, shrinks, convert charsets, can copy, move, initialize
    from anything, print hex, make coffee, escape, append anything, and
    so on.
9. StringBuffer - helper template to initialize a String from a local
    fixed buffer (very common usage pattern).

And I'm pretty sure I've missed a few.
Two questions:

- what did I miss?

10. ConnectSE strings, which re-implement String with a different allocator, but otherwise repeat String functionality.

11. There are also places where we have just pairs

  const char *name;
  uint/size_t name_length;

inside bigger structures.


- how do your new classes fits into it?


They do not fit.

Various strings have these properties:
1. maximum possible length
2. pointer signness
3. "const" qualifier in the pointer
4. memory allocation/realloction possibility.
5. Needed in the pure C API


A column name is a string with these properties:
1. Not longer than 64K (perhaps even not longer than 255)
2. Signed pointer
3. Constant pointer
4. No [re]allocation
5. Not needed in C API

Non of the currently existing classes or structures perfectly fit.


please add 10, 11, etc to the list as you see fit.


12. Name, with "uint16 length" :)



Note, some of these strings could share some code if:

1. We use a template, with abstract pointer and length types.
So we can pass say "const char *" as PTR and uint16 as LENGTH,
to get Name.
String, Name, CSET_STRING can have a common template root.


2. We accept:

MDEV-7063 Split String into logical components

Unfortunately, nobody was willing to review this patch when
I tried to help Olivier not to re-implement String.

This is very very very pity and discouraging. I'm crying
in the nights and eat anti-depressants :)
Well, not really yet, but getting very close to this :)



this is logically wrong.
A field *has* a name, that's right. But a field *is not* a name.
The Field class should include the name, not be inherited from it.

Can I spend a few more seconds of you time, to have some fun?

This reminds me the phrase:

A rabbit IS not only costly fur, but IS also 3-4 kilograms of dietetic
meat.

Looks like this phrase is also logically wrong.
Or should I rather say it *has* a logically wrong phrasing? :)

Yeah, human languages are often (most of the time?) use words loosely
and rely on implicit meaning. Like "can you pass the salt, please?" -
the expected answer is to pass the salt, not to say "yes, I can".
(and, strictly speaking, human languages don't use words and don't rely
on the meaning. Humans do :)

Regards,
Sergei
Chief Architect MariaDB
and security@xxxxxxxxxxx



Follow ups

References