linuxdcpp-team team mailing list archive

Thread
Date
[Bug 1715870] [NEW] Text::utf8ToWc and Text::wcToUtf8 interfaces/declarations incorrectly assume wchar_t on Win32 can represent any Unicode codepoint

To: linuxdcpp-team@xxxxxxxxxxxxxxxxxxx
From: cologic <1715870@xxxxxxxxxxxxxxxxxx>
Date: Fri, 08 Sep 2017 13:07:43 -0000
Reply-to: Bug 1715870 <1715870@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

int utf8ToWc(const char* str, wchar_t& c);
void wcToUtf8(wchar_t c, string& str);

Both assume that every relevant Unicode codepoint can be represented as
one wchar_t. This is not the case.

On at least certain Win32 platforms, https://msdn.microsoft.com/en-
us/library/gg269344%28v=exchg.10%29.aspx and https://msdn.microsoft.com
/en-us/library/windows/desktop/aa367308(v=vs.85).aspx among other MSDN
pages document that sizeof(wchar_t) == 2, or 16 bits, not enough for
e.g., many of the emoji which
https://apps.timwhitlock.info/emoji/tables/unicode lists.

https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx describes how:
==================================================
Windows applications normally use UTF-16 to represent Unicode character data. The use of 16 bits allows direct representation of 65,536 unique characters, but this Basic Multilingual Plane (BMP) is not nearly enough to cover all the symbols used in human languages. Unicode version 4.1 includes over 97,000 characters, with over 70,000 characters for Chinese alone.

The Unicode standard has established 16 additional "planes" of
characters, each the same size as the BMP. Naturally, most code points
beyond the BMP do not yet have characters assigned to them, but
definition of the planes gives Unicode the potential to define 1,114,112
characters (that is, 2¹⁶ * 17 characters) within the code point range
U+0000 to U+10FFFF. For UTF-16 to represent this larger set of
characters, the Unicode Standard defines "supplementary characters".

A supplementary character is a character located beyond the BMP, and a "surrogate" is a UTF-16 code value. For UTF-16, a "surrogate pair" is required to represent a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using the surrogate mechanism, UTF-16 can support all 1,114,112 potential Unicode characters. For more details about supplementary characters, surrogates, and surrogate pairs, refer to The Unicode Standard.
==================================================

To use this surrogate pair mechanism, Text::utf8ToWc and Text::wcToUtf8,
along with downstream users (e.g., in dcpp/Util.cpp) would have to be
adapted to allow multiple wchar_t values per codepoint.

** Affects: dcplusplus
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of
Dcplusplus-team, which is subscribed to DC++.
https://bugs.launchpad.net/bugs/1715870

Title:
  Text::utf8ToWc and Text::wcToUtf8 interfaces/declarations incorrectly
  assume wchar_t on Win32 can represent any Unicode codepoint

Status in DC++:
  New

Bug description:
  int utf8ToWc(const char* str, wchar_t& c);
  void wcToUtf8(wchar_t c, string& str);

  Both assume that every relevant Unicode codepoint can be represented
  as one wchar_t. This is not the case.

  On at least certain Win32 platforms, https://msdn.microsoft.com/en-
  us/library/gg269344%28v=exchg.10%29.aspx and
  https://msdn.microsoft.com/en-
  us/library/windows/desktop/aa367308(v=vs.85).aspx among other MSDN
  pages document that sizeof(wchar_t) == 2, or 16 bits, not enough for
  e.g., many of the emoji which
  https://apps.timwhitlock.info/emoji/tables/unicode lists.

  https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx describes how:
  ==================================================
  Windows applications normally use UTF-16 to represent Unicode character data. The use of 16 bits allows direct representation of 65,536 unique characters, but this Basic Multilingual Plane (BMP) is not nearly enough to cover all the symbols used in human languages. Unicode version 4.1 includes over 97,000 characters, with over 70,000 characters for Chinese alone.

  The Unicode standard has established 16 additional "planes" of
  characters, each the same size as the BMP. Naturally, most code points
  beyond the BMP do not yet have characters assigned to them, but
  definition of the planes gives Unicode the potential to define
  1,114,112 characters (that is, 2¹⁶ * 17 characters) within the code
  point range U+0000 to U+10FFFF. For UTF-16 to represent this larger
  set of characters, the Unicode Standard defines "supplementary
  characters".

  A supplementary character is a character located beyond the BMP, and a "surrogate" is a UTF-16 code value. For UTF-16, a "surrogate pair" is required to represent a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using the surrogate mechanism, UTF-16 can support all 1,114,112 potential Unicode characters. For more details about supplementary characters, surrogates, and surrogate pairs, refer to The Unicode Standard.
  ==================================================

  To use this surrogate pair mechanism, Text::utf8ToWc and
  Text::wcToUtf8, along with downstream users (e.g., in dcpp/Util.cpp)
  would have to be adapted to allow multiple wchar_t values per
  codepoint.

To manage notifications about this bug go to:
https://bugs.launchpad.net/dcplusplus/+bug/1715870/+subscriptions