← Back to team overview

maria-developers team mailing list archive

Re: Please review MDEV-8092 Change Create_field::field from "const char *" to LEX_CSTRING

 

Hi Sergei,


On 01/09/2016 11:54 PM, Sergei Golubchik wrote:
Hi, Alexander!

On Jan 09, Alexander Barkov wrote:
I disagree to go the easier way with LEX_STRING.  LEX_STRING is a
very unfortunately designed thing for the server side code.

It was a kind of Ok on 32 bit platforms.
But now on 64bit platforms it's just a waste of memory.
Length can never be large enough to use the entire "size_t".

I wouldn't sweat over it. No matter what you do with LEX_STRING it won't
decrease server memory requirements in any noticeable way.

Monty proposes to collect some of the bool flags in Items into
bit mask, to save a few bytes, because he thinks it's important.
You propose to lose 4 or 6 bytes, because it does not affect
server memory in any noticeable ways and thus is not important.

I agree with Monty here.
Saving a few bytes is at least not harmful and should affect
performance positively. A little bit, but still positively.

I am not afraid of having more string classes.

It is harmful in a sense that it adds more concepts to the source code
and a developer needs to keep them in mind when writing new code.
These new string classes make code base more complex. Sometimes the
added complexity is unavoidable. Sometimes it's unnecessary.

Anyway, you say that this change improves the performance? I don't think
it does. May even make the performance worse - access to unaligned
memory can be slower (presuming you've packed your data, otherwise the
compiler can align them, which would destroy all your "space savings").
So, the final proof is to benchmark it and see.

Attached.

I run the test as follows:

g++ -O3 s.cc && ./a.out


Ignore the first iteration
(I separated it from the other iterations with "------"),
as it warms up the processor, so the very first result is always
the worst.


As you can see, performance is the same. But:

- The size of the one with "unsigned int length" is 16.
- The size of the one with "size_t length" is 24.

Note, I did not use any packing.
The compiler is smart enough to use the unused 4 bytes to store the
extra members of the child class when inheriting,
like m_repertoire in this example.



When column metadata is sent to the client side, there is a character
conversion from utf8 to character_set_client.
If we remember the column name metadata inside the Name class,
we can save on conversion and use faster non-converting methods instead.

What kind of metadata? The repertoire?

Yes, at least repertoire. But maybe something else in the future.

For example, store short identifiers directly in the memory used by "const char *ptr", so it does not even need any buffer on memroot
(or elsewhere). If we do this, there will be a need for a flag telling
that there is no an external buffer when the data is inside.

Btw, the same thing could be useful in String.

IIRC, some OSes do this trick with std::string.


I've checked, LEX_STRING is only used in the plugin API that is
versioned and we can safely change LEX_STRING in 10.2 to use, say,
uint16, if needed.
I propose not to change LEX_STRING.

Fine :)

Moreover, it's not needed for server.
Only needed for the plugin API.


I think that:

1. Trying to reuse classes that are not fully suitable for a task is
more harmful than having more classes, because this is simply
not efficient.

2. Nobody objects to have many Item or Field classes that suite
a particular functionality better. But some people are very afraid
of having more strings. Why?
Various strings that we have in the server code *are* different enough.
Different string do deserve different classes.

Precisely. So, to know what different strings are there in the server, I
want to list them all. Then we'll see whether how string classes fit in.

1. char* - zero terminated string. I don't see why one should ever use
     it (in the server).
2. uchar* - byte array, not zero terminated, no length. Even worse.
     Never ever.
3. LEX_STRING - a string and a length, no charset. Used in C with a
     preallocated buffer.
4. LEX_CSTRING - constant string, no charset.
5. LEX_CUSTRING - constant byte array? weird, but possible.
6. DYNAMIC_STRING - C string allocated on the heap. Can be appended to,
     reallocates automatically. No charset.
7. CSET_STRING - C++ LEX_STRING with a charset.
8. String - swiss army knife C++ string. Uses a preallocated buffer or
     heap. Grows, shrinks, convert charsets, can copy, move, initialize
     from anything, print hex, make coffee, escape, append anything, and
     so on.
9. StringBuffer - helper template to initialize a String from a local
     fixed buffer (very common usage pattern).

And I'm pretty sure I've missed a few.
Two questions:

- what did I miss?

10. ConnectSE strings, which re-implement String with a different
allocator, but otherwise repeat String functionality.

Ok, so you're basically saying that String is not generic enough and
sometimes one needs a String that uses a different allocator.
I agree with that.

Excellent.

Now guess how happier Olivier and I could be if you
said this in October 2014, when I sent the patch implementing this :)
https://mariadb.atlassian.net/browse/MDEV-7063


11. There are also places where we have just pairs

    const char *name;
    uint/size_t name_length;

inside bigger structures.

That's basically my #1, we should never do it.
They should be replaced with LEX_STRINGs asap.

Right, or to some class LEX_STRING-alike class.
(but let's wait for what we will end up with Name).


- how do your new classes fits into it?

They do not fit.

Various strings have these properties:
1. maximum possible length
2. pointer signness
3. "const" qualifier in the pointer
4. memory allocation/realloction possibility.
5. Needed in the pure C API

Right. So

LEX_STRING:     signed/noconst/no/C
LEX_CSTRING:    signed/const/no/C
LEX_CUSTRING:   unsigned/const/no/C
DYNAMIC_STRING: signed/noconst/yes/C
CSET_STRING:    signed/noconst/no/C++
String:         signed/noconst/yes/C++

hmm, apparently, you've forgot

    6. charset and other metadata

last two classes have that, other do not.
And, as you can see, I've omitted "maximum possible length" part, as I
don't think it matters much.

It matters only for the proper type for the "length" member.


A column name is a string with these properties:
1. Not longer than 64K (perhaps even not longer than 255)
2. Signed pointer
3. Constant pointer
4. No [re]allocation
5. Not needed in C API

Non of the currently existing classes or structures perfectly fit.

it's signed/const/no/C++ which is pretty much LEX_CSTRING, just as
you've originally had in MDEV. If you add

   6. knows its charset (and repertoire)

then it's CSET_STRING or a (new) CSET_CSTRING, if you want the 'const'
qualifier (although, I suspect, that const CSET_STRING might work too).

Name knows its charset. But it's always utf8.
So no needs for a CHARSET_INFO pointer.

But having repertoire could help to avoid conversion for pure ASCII
identifiers (which is the majority).


please add 10, 11, etc to the list as you see fit.

12. Name, with "uint16 length" :)

I do not think we need it.

What about "uint length"? I.e. 4 bytes for length,
thus save 4 bytes for something extra, like repertoire
and other flags.



Note, some of these strings could share some code if:

1. We use a template, with abstract pointer and length types.
So we can pass say "const char *" as PTR and uint16 as LENGTH,
to get Name.
String, Name, CSET_STRING can have a common template root.

Yes.

2. We accept:

MDEV-7063 Split String into logical components

Unfortunately, nobody was willing to review this patch when
I tried to help Olivier not to re-implement String.

This is assigned to Monty now.

This is very very very pity and discouraging. I'm crying
in the nights and eat anti-depressants :)
Well, not really yet, but getting very close to this :)

Sorry :(

Let's speed up the discussion and take this string-class question on the
next maria call?

Well, at this point I'd prefer to postpone
"MDEV-7063 Split String into logical components" because
Olivier has already copied everything he needed from String
to his own class. So this is not urgent now.
But I'd like to return to it at some point.
Now we need to decide about Name, as it is really urgent.

Thanks.


Regards,
Sergei
Chief Architect MariaDB
and security@xxxxxxxxxxx

#include <stdio.h>
#include <sys/time.h>

static unsigned long long gettimer()
{
  struct timeval tm;
  return gettimeofday(&tm, NULL) ? 0 : tm.tv_sec * 1000000 + tm.tv_usec;
}


template <class LENGTH> class STR
{
  const char *m_ptr;
  LENGTH m_length;
public:
  STR(const char *str, LENGTH length)
   :m_ptr(str), m_length(length)
  { }
  STR(const volatile STR &other)
   :m_ptr(other.m_ptr), m_length(other.m_length)
  { }
  const char *ptr() volatile const { return m_ptr; }
  LENGTH length() volatile const { return m_length; }
};


class S8: public STR<size_t>
{
  unsigned int m_repertoire:1;
public:
  S8(const char *str, size_t length)
   :STR(str, length),
    m_repertoire(0)
  { }
  S8(const volatile S8 &other)
   :STR(other),
   m_repertoire(other.m_repertoire)
  { }
};


class S4: public STR<unsigned int>
{
  unsigned int m_repertoire:1;
public:
  S4(const char *str, size_t length)
    :STR(str, length),
    m_repertoire(0)
  { }
  S4(const volatile S4 &other)
   :STR(other),
    m_repertoire(other.m_repertoire)
  { }
};


#define NTIMES 100000000000ULL

template <class S> void test()
{
  volatile S str("aaa", 3);
  volatile size_t count= 0;
  unsigned long long timer;
  printf("sizeof()=%zu\n", sizeof(str));

  printf("  Calling length()                : ");
  timer= gettimer();
  for (size_t i= 0; i < NTIMES ; i++)
  {
    if (str.length() > 3)
      count++;
  }
  printf("  %f\n", (double) (gettimer() - timer) / 1000000e0);

  printf("  Calling constructor and length(): ");
  timer= gettimer();
  for (size_t i= 0; i < NTIMES; i++)
  {
    volatile S str1(str);
    if (str1.length() > 3)
      count++;
  }
  printf("  %f\n", (double) (gettimer() - timer) / 1000000e0);
  if (count > 0)
    printf("  Count= %zu\n", count);
}


int main()
{
  for (int i= 0; i < 8;  i++)
  {
    test<S4>();
    test<S8>();
    printf("\n");
  }
  return 0;
}
sizeof()=16
  Calling length()                :   51.708344
  Calling constructor and length():   77.398307
sizeof()=24
  Calling length()                :   51.860190
  Calling constructor and length():   77.198848

-----------------------------------------------

sizeof()=16
  Calling length()                :   51.959604 -
  Calling constructor and length():   77.119053 +
sizeof()=24
  Calling length()                :   51.595511
  Calling constructor and length():   77.229338

sizeof()=16
  Calling length()                :   51.401640 +
  Calling constructor and length():   77.262307 +
sizeof()=24
  Calling length()                :   51.998148
  Calling constructor and length():   77.298949

sizeof()=16
  Calling length()                :   51.425688
  Calling constructor and length():   77.337811
sizeof()=24
  Calling length()                :   51.805097
  Calling constructor and length():   77.275466

sizeof()=16
  Calling length()                :   51.481474
  Calling constructor and length():   78.011199
sizeof()=24
  Calling length()                :   51.783683
  Calling constructor and length():   77.411159

sizeof()=16
  Calling length()                :   51.472335
  Calling constructor and length():   77.337360
sizeof()=24
  Calling length()                :   51.496041
  Calling constructor and length():   77.285623

sizeof()=16
  Calling length()                :   51.949092
  Calling constructor and length():   77.201714
sizeof()=24
  Calling length()                :   51.836312
  Calling constructor and length():   77.150913

sizeof()=16
  Calling length()                :   52.105839
  Calling constructor and length():   77.193530 
sizeof()=24
  Calling length()                :   51.559920
  Calling constructor and length():   77.191390

References