linuxdcpp-team team mailing list archive
-
linuxdcpp-team team
-
Mailing list archive
-
Message #08813
[Bug 1715870] [NEW] Text::utf8ToWc and Text::wcToUtf8 interfaces/declarations incorrectly assume wchar_t on Win32 can represent any Unicode codepoint
Public bug reported:
int utf8ToWc(const char* str, wchar_t& c);
void wcToUtf8(wchar_t c, string& str);
Both assume that every relevant Unicode codepoint can be represented as
one wchar_t. This is not the case.
On at least certain Win32 platforms, https://msdn.microsoft.com/en-
us/library/gg269344%28v=exchg.10%29.aspx and https://msdn.microsoft.com
/en-us/library/windows/desktop/aa367308(v=vs.85).aspx among other MSDN
pages document that sizeof(wchar_t) == 2, or 16 bits, not enough for
e.g., many of the emoji which
https://apps.timwhitlock.info/emoji/tables/unicode lists.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx describes how:
==================================================
Windows applications normally use UTF-16 to represent Unicode character data. The use of 16 bits allows direct representation of 65,536 unique characters, but this Basic Multilingual Plane (BMP) is not nearly enough to cover all the symbols used in human languages. Unicode version 4.1 includes over 97,000 characters, with over 70,000 characters for Chinese alone.
The Unicode standard has established 16 additional "planes" of
characters, each the same size as the BMP. Naturally, most code points
beyond the BMP do not yet have characters assigned to them, but
definition of the planes gives Unicode the potential to define 1,114,112
characters (that is, 2¹⁶ * 17 characters) within the code point range
U+0000 to U+10FFFF. For UTF-16 to represent this larger set of
characters, the Unicode Standard defines "supplementary characters".
A supplementary character is a character located beyond the BMP, and a "surrogate" is a UTF-16 code value. For UTF-16, a "surrogate pair" is required to represent a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using the surrogate mechanism, UTF-16 can support all 1,114,112 potential Unicode characters. For more details about supplementary characters, surrogates, and surrogate pairs, refer to The Unicode Standard.
==================================================
To use this surrogate pair mechanism, Text::utf8ToWc and Text::wcToUtf8,
along with downstream users (e.g., in dcpp/Util.cpp) would have to be
adapted to allow multiple wchar_t values per codepoint.
** Affects: dcplusplus
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of
Dcplusplus-team, which is subscribed to DC++.
https://bugs.launchpad.net/bugs/1715870
Title:
Text::utf8ToWc and Text::wcToUtf8 interfaces/declarations incorrectly
assume wchar_t on Win32 can represent any Unicode codepoint
Status in DC++:
New
Bug description:
int utf8ToWc(const char* str, wchar_t& c);
void wcToUtf8(wchar_t c, string& str);
Both assume that every relevant Unicode codepoint can be represented
as one wchar_t. This is not the case.
On at least certain Win32 platforms, https://msdn.microsoft.com/en-
us/library/gg269344%28v=exchg.10%29.aspx and
https://msdn.microsoft.com/en-
us/library/windows/desktop/aa367308(v=vs.85).aspx among other MSDN
pages document that sizeof(wchar_t) == 2, or 16 bits, not enough for
e.g., many of the emoji which
https://apps.timwhitlock.info/emoji/tables/unicode lists.
https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx describes how:
==================================================
Windows applications normally use UTF-16 to represent Unicode character data. The use of 16 bits allows direct representation of 65,536 unique characters, but this Basic Multilingual Plane (BMP) is not nearly enough to cover all the symbols used in human languages. Unicode version 4.1 includes over 97,000 characters, with over 70,000 characters for Chinese alone.
The Unicode standard has established 16 additional "planes" of
characters, each the same size as the BMP. Naturally, most code points
beyond the BMP do not yet have characters assigned to them, but
definition of the planes gives Unicode the potential to define
1,114,112 characters (that is, 2¹⁶ * 17 characters) within the code
point range U+0000 to U+10FFFF. For UTF-16 to represent this larger
set of characters, the Unicode Standard defines "supplementary
characters".
A supplementary character is a character located beyond the BMP, and a "surrogate" is a UTF-16 code value. For UTF-16, a "surrogate pair" is required to represent a single supplementary character. The first (high) surrogate is a 16-bit code value in the range U+D800 to U+DBFF. The second (low) surrogate is a 16-bit code value in the range U+DC00 to U+DFFF. Using the surrogate mechanism, UTF-16 can support all 1,114,112 potential Unicode characters. For more details about supplementary characters, surrogates, and surrogate pairs, refer to The Unicode Standard.
==================================================
To use this surrogate pair mechanism, Text::utf8ToWc and
Text::wcToUtf8, along with downstream users (e.g., in dcpp/Util.cpp)
would have to be adapted to allow multiple wchar_t values per
codepoint.
To manage notifications about this bug go to:
https://bugs.launchpad.net/dcplusplus/+bug/1715870/+subscriptions