maria-developers team mailing list archive

Thread
Date

Re: Review: bb-10.8-wlad-MDEV-26713

To: Vladislav Vaintroub <wlad@xxxxxxxxxxx>, maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
From: Alexander Barkov <bar@xxxxxxxxxxx>
Date: Mon, 13 Dec 2021 10:44:46 +0400
In-reply-to: <11789FFB-D840-4A38-B70B-980F6189CE40@hxcore.ol>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0

  Hi Wlad,


On 12/13/21 2:15 AM, Vladislav Vaintroub wrote:

Hi Bar,
Thanks for review, much appreciated! I addressed most of the comments inthe new pushhttps://github.com/MariaDB/server/compare/bb-10.8-wlad-MDEV-26713<https://github.com/MariaDB/server/compare/bb-10.8-wlad-MDEV-26713>
commit 1411d5f40ceaf2eecabbcf68588743439003f4a2
Author: Vladislav Vaintroub <wlad@xxxxxxxxxxx >
Date:   Fri Nov 19 11:42:46 2021 +0100
+  wchar_t wbuf[80];
+  char to[80*3]; /* there is at most 3 bytes per wchar, in any codepage */
+  wchar_t *pos=wbuf,*end=wbuf + array_elements(wbuf)-1;
Isn't it "at most 4 bytes per char"?
You’re right. I missed GB18030, single UTF16 wchar_t can expand to fourGB18030 bytes . But for UTF8, single UTF16 wchar_t can’t expand to 4bytes, if you mean that?



- One wchar_t on Windows can encode U+0000..U+FFFF (BMP) only,
  so it produces not more than 3 bytes in UTF8 indeed.

- A 4-byte UTF8 (SMP) can be produced from two wchar_t's only
  (a high+low surrogate pair).

- All multi-byte character sets (except UTF-32 and GB18030)
  can produce 3 bytes per character at most.

- GB18030 can produce 4-byte sequences from BMP characters,
  i.e. from a single wchar_t.
  What will happen with this code if one starts a client
  program with GB18030 as a console code page?
  The code that we're discussing here uses GetConsoleCP(),
  so I guess it CAN be GB18030, even though we don't support it yet.

- UTF-32 produces 4 byte characters.
  What will happen if one starts a client program
  with UTF-32 as a console codepage?


Also, please consider adding a comment.
It's not obvious where 3 comes from, especially for a Linux
addicted reader where a single wchar_t encode the entire Unicode range.

#############################################################

commit 7af691ca11c95a385aff4889a56870d70843e78d

Author: Vladislav Vaintroub <wlad@xxxxxxxxxxx >

-  pthread_auto_mutex_decl(my_conio_cs);

...

There was a mutex lock inside my_cgets().

Why was it needed before the patch and why is it not needed any more?

The old code did more than it had to. We do not need mutex locking forreading line from console window, and we do not use threads in commandline client.


Right. The old code seemed to be redundant.
Strange, why this lock was there.

########################################################
commit b9c38470ed6e04185f7f84a056034b7254336d51
Author: Vladislav Vaintroub <wlad@xxxxxxxxxxx >
Date:   Mon Nov 22 13:22:05 2021 +0100
......
Should not we add an UTF8 BOM to the beginning of the file,
so text editors do not break it later.
It might render it unusable for our own stuff, see MySQL Bugs: #67880:mysql 5.6.9 on windows does not start if my.ini file is in utf8 format<https://bugs.mysql.com/bug.php?id=67880>
I see no reason for having BOM now. Windows’ own editors, Notepad bydefaults stores in BOM-less UTF-8 (on systems we target with this workat last), VS Code, Notepad++ . Some Linux tools however would haveproblems with BOM, in particular “cat” . Anyway, its been a long whilesince I’ve seen UTF8 being not correctly detected and handled by any editor.


Okey, let's go without BOM. Our code is not always BOM friendly indeed.

########################################################
commit 51ce069376fec45d340563a4f7630f8a155afd64
Author: Vladislav Vaintroub <wlad@xxxxxxxxxxx >
Date:   Fri Nov 26 18:41:35 2021 +0100
MDEV-26713 Introduce --password-codepage client setting
   I'm not sure we need this new options.
   Looks like over-engineering a bit.
Ok,I agree new parameter is not nice. And, found a way to handle thatwithout , using --default-character-set MDEV-26713 allow users withnon-UTF8 passwords to login after upgrade. · MariaDB/server@b45c887(github.com)<https://github.com/MariaDB/server/commit/b45c8872742221b921a02118fcd33281925dbf83>
I’m firmly convinced, that if we are about to break connectivity, and weknow about it, we must to offer something else than “contact youradministrator”. That administrator could be the guy who implementedpassword validation plugin to force at last one non-ASCII character (itis safer, say some people on the internet, harder to break forhackers), and now that poor admin is busy fixing a hundreds of “resetmy password” requests. Who knows, I do not think this scenario isentirely unrealistic.
We have no clue how much impact breaking non-ASCII passwords would be.We never warned users, that such passwords would not work, or are “notportable” .It is impossible to run an automated check, whether suchpasswords were used, prior to upgrade, as we never stored the actualpassword.
At the end, users are not guilty in the mess we created ourselves byhandling passwords as bytes, rather than strings. Thus user should notbe punished for something he had not broke. Instead, an explanation,and an easy workaround that allows to connect after upgrade, must be inrelease notes.


I agree and I like that you found a way to do it without a new option.

#####################################################
commit 5c09e9ce4f922fe9674f3f85329742c0c5498a28
Author: Vladislav Vaintroub <wlad@xxxxxxxxxxx >
Date:   Fri Nov 19 14:03:51 2021 +0100
MDEV-27092 Windows - services that have non-ASCII characters do not
  >work with activeCodePage=UTF8
This code:
+  AWSTRDUP(w_ServiceName,lpServiceName);
    ....
duplicates two times. Consider wrapping these variables into a class
and reuse the class in both cases.
It duplicates because those functions are a trivial wrappers for 2 Cfunctions, with similar, but not exact the same signature I consideredwrapping, a C++ class, spent some time trying. It turned out lookingtoo artificial, and added more code actually that it removed, which mademe believe there is no good abstraction for this case. I’m leaving it as-is.


Okey.

########################################################
commit b155c4036cfb3ffdaa7c272048ac9d32e3182d78
Author: Vladislav Vaintroub <wlad@xxxxxxxxxxx >
Date:   Mon Nov 29 19:47:36 2021 +0100
MDEV-26713 set console codepage to what user set in
  >--default-character-set
...
+static void adjust_console_codepage()
...<skip>...
Also, I think this function should only be called when
default_charset was some explicit name, not MYSQL_AUTODETECT_CHARSET_NAME.
I think the logic should be:
1. If --default-character-set=auto, then detect it from the console.
2. If --default-character-set=csname, then set the console CP to csname.
Now in case of N1 it seems to detect the character set from the console,
then sets the console to what has been detected.
Is the last step really needed in case of auto-detection?
It seems like it would not be needed. However our autodetection isloose, and this has some implications :
Let's say the console codepage is cp437. And we got some old Windows,pre Win10 1903 .
cp437 would be autodetected as MySQL cp850. There are 46 codepointpoint differences between cp437 and cp850, (s.https://en.wikipedia.org/wiki/Code_page_850, differences marked yellow),so those codepages are not really very similar, differing in ~20% ofthe code points.
In the logic I used, console would be is fixed from 437 to 850, thusshowing French accented characters for French accented characters indata, rather than random Greek.
So, I’d think last step was fine.


Thanks. This explains. In case of a non-exact detection it's good idea
to adjust the console. Can you please add a comment to the code?

########################################################
commit b38c341cc392630c798eb754f87338d37ec95cdf
Author: Vladislav Vaintroub <wlad@xxxxxxxxxxx >
Date:   Sun Dec 5 17:08:18 2021 +0100
Fix race condition in test.
Can you extend the comment and tell what actually happened
before this change?
I rebased into into some other commit, but so I'll describe here whathappened here.
On the “new” buildbot, Windows host runs mysql-test-run.pl --parallel=64. It is an “old” version of Windows in terms of this work, Windows 10build 1803, so it uses legacy codepage detection via console CP. And itruns a test client_charset_win.test, which is
chcp 1257 > NULL & $MYSQL --default-character-set=auto -e "select@@character_set_client"
The machine is US English Windows, with default console codepage is 437. However test would sometimes, actually most of the times, show cp850instead of expected cp1257
What happens here, there is that of builbot slave runs perl (MTR), andall MTR subprocesses share the same console object, the one of pythonprocess, which is builbot slave.
Now, none of the subprocesses processes would be using console, andthere is no visible window even. Output is redirected for allprocesses, as well as input, where appropriate. Yes, one of theclients managed to change the shared console codepage to 850, in thetiny time window between chcp and the client startup . I’d guess thiswas fixing of loosely autodetected cp850 on cp437, as I described earlier.
So the fix is avoid changing console codepage, unless there is an actualconsole output.


Thank for the explanation. I still have some questions. Let's
talk on slack.

Best regards,

Wlad

*From: *Alexander Barkov <mailto:bar@xxxxxxxxxxx>
*Sent: *Friday, 10 December 2021 14:40
*To: *maria-developers <mailto:maria-developers@xxxxxxxxxxxxxxxxxxx>;Vladislav Vaintroub <mailto:wlad@xxxxxxxxxxx>
*Subject: *Review: bb-10.8-wlad-MDEV-26713

Hi Wlad,

You've done a great piece of work. Thanks and congrats.

..

References

Review: bb-10.8-wlad-MDEV-26713
From: Alexander Barkov, 2021-12-10
Re: Review: bb-10.8-wlad-MDEV-26713
From: Vladislav Vaintroub, 2021-12-12