← Back to team overview

kicad-developers team mailing list archive

Re: 6.0 string proposal

 

On 30/04/2019 18:19, Jeff Young wrote:
I was referring to UCS-2 or UCS-4.  I’m evidently behind the times, though, because I now see that UTF-32 and UCS-4 are equivalent.

(Which means that both some of John’s original premises and my quote in teal below were wrong: UTF32 is indeed a one:one map between code points and chars.)

Kind of, depending the on definition of character. As long as you never get any multi-code point "characters".

So my proposal (in 2019) should be std::u32string (using UTF32 encoding, for which myString[3] still works).

By "works", what do you mean? Sure you can index into a UTF-32 string and come up with a valid (whole) code point (and a valid code unit). But that doesn't mean a lot: it could be the "ᄀ" (\u1100) from 가, which is actually 2 code points.

How often do we actually index into a string buffer by code point anyway, without iterating the string to find something first? What does that even mean in the context of a Unicode string?

Graphemes are not a strange and ignorable edge case: emojis may sound silly, but lots of actual languages use grapheme clusters perfectly casually (Tamil, Telegu[1], Hangul as above, etc). You either support Unicode or you don't, you cannot pick and choose what is "reasonable" to support.

BTW, UTF-8 is does allow you to index into it by byte and see if you're on a code point boundary (if the byte starts 0b10xxxxxx, you are not). You can't index to the n'th code point (but for what purpose?) and you still can't index to the n'th grapheme, but you can't do that in *any* encoding.

Better?

As long as we save our files as UTF-8, I don't really mind what we use internally. But if you actually plan to manipulate strings that could be Unicode and it comes from a user, you cannot do it only by code point, regardless of representation.

Cheers,

John

[1]: Mishandling of Telegu produced the iPhone SMS of Death bug.


Follow ups

References