kicad-developers team mailing list archive

Thread
Date

Re: 6.0 string proposal

To: Jeff Young <jeff@xxxxxxxxx>, Seth Hillbrand <seth@xxxxxxxxxxxxx>
From: John Beard <john.j.beard@xxxxxxxxx>
Date: Tue, 30 Apr 2019 21:35:55 +0100
Cc: kicad-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CF49F4EF-7B76-430C-A112-974332354C78@rokeby.ie>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.0

On 30/04/2019 18:19, Jeff Young wrote:

I was referring to UCS-2 or UCS-4.  I’m evidently behind the times, though, because I now see that UTF-32 and UCS-4 are equivalent.

(Which means that both some of John’s original premises and my quote in teal below were wrong: UTF32 is indeed a one:one map between code points and chars.)

Kind of, depending the on definition of character. As long as you neverget any multi-code point "characters".

So my proposal (in 2019) should be std::u32string (using UTF32 encoding, for which myString[3] still works).

By "works", what do you mean? Sure you can index into a UTF-32 stringand come up with a valid (whole) code point (and a valid code unit). Butthat doesn't mean a lot: it could be the "ᄀ" (\u1100) from 가, which isactually 2 code points.

How often do we actually index into a string buffer by code pointanyway, without iterating the string to find something first? What doesthat even mean in the context of a Unicode string?

Graphemes are not a strange and ignorable edge case: emojis may soundsilly, but lots of actual languages use grapheme clusters perfectlycasually (Tamil, Telegu[1], Hangul as above, etc). You either supportUnicode or you don't, you cannot pick and choose what is "reasonable" tosupport.

BTW, UTF-8 is does allow you to index into it by byte and see if you'reon a code point boundary (if the byte starts 0b10xxxxxx, you are not).You can't index to the n'th code point (but for what purpose?) and youstill can't index to the n'th grapheme, but you can't do that in *any*encoding.

Better?

As long as we save our files as UTF-8, I don't really mind what we useinternally. But if you actually plan to manipulate strings that could beUnicode and it comes from a user, you cannot do it only by code point,regardless of representation.


Cheers,

John

[1]: Mishandling of Telegu produced the iPhone SMS of Death bug.

Follow ups

Re: 6.0 string proposal
From: Jeff Young, 2019-04-30

References

6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: Andrew Lutsenko, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: Wayne Stambaugh, 2019-04-30
Re: 6.0 string proposal
From: Dmitry Salychev, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: John Beard, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: Seth Hillbrand, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30