kicad-developers team mailing list archive
-
kicad-developers team
-
Mailing list archive
-
Message #40371
Re: 6.0 string proposal
On 30/04/2019 18:19, Jeff Young wrote:
I was referring to UCS-2 or UCS-4. I’m evidently behind the times, though, because I now see that UTF-32 and UCS-4 are equivalent.
(Which means that both some of John’s original premises and my quote in teal below were wrong: UTF32 is indeed a one:one map between code points and chars.)
Kind of, depending the on definition of character. As long as you never
get any multi-code point "characters".
So my proposal (in 2019) should be std::u32string (using UTF32 encoding, for which myString[3] still works).
By "works", what do you mean? Sure you can index into a UTF-32 string
and come up with a valid (whole) code point (and a valid code unit). But
that doesn't mean a lot: it could be the "ᄀ" (\u1100) from 가, which is
actually 2 code points.
How often do we actually index into a string buffer by code point
anyway, without iterating the string to find something first? What does
that even mean in the context of a Unicode string?
Graphemes are not a strange and ignorable edge case: emojis may sound
silly, but lots of actual languages use grapheme clusters perfectly
casually (Tamil, Telegu[1], Hangul as above, etc). You either support
Unicode or you don't, you cannot pick and choose what is "reasonable" to
support.
BTW, UTF-8 is does allow you to index into it by byte and see if you're
on a code point boundary (if the byte starts 0b10xxxxxx, you are not).
You can't index to the n'th code point (but for what purpose?) and you
still can't index to the n'th grapheme, but you can't do that in *any*
encoding.
Better?
As long as we save our files as UTF-8, I don't really mind what we use
internally. But if you actually plan to manipulate strings that could be
Unicode and it comes from a user, you cannot do it only by code point,
regardless of representation.
Cheers,
John
[1]: Mishandling of Telegu produced the iPhone SMS of Death bug.
Follow ups
References
-
6.0 string proposal
From: Jeff Young, 2019-04-30
-
Re: 6.0 string proposal
From: Andrew Lutsenko, 2019-04-30
-
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
-
Re: 6.0 string proposal
From: Wayne Stambaugh, 2019-04-30
-
Re: 6.0 string proposal
From: Dmitry Salychev, 2019-04-30
-
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
-
Re: 6.0 string proposal
From: John Beard, 2019-04-30
-
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
-
Re: 6.0 string proposal
From: Seth Hillbrand, 2019-04-30
-
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30