← Back to team overview

kicad-developers team mailing list archive

Re: 6.0 string proposal

 

I think we’re a long way from handling Hiragana, Katakana or Kanji.  Probably the same for Tamil or Telegu, although I know nothing about them.

So why do we scan strings?  We do it when tokenizing, but all our tokens are roman (if not ascii), so that should be OK.

We also do it looking for numbers to increment.  We’d like this to work for other languages, but as long as their subsequent-code-points don’t look like roman digits I think we’re OK.  (Subsequent-code-points are all above ascii, right?).

We do some case conversions when doing compares.  But again, as long as subsequent code points don’t look like ascii we should be OK.  I assume capitalization algorithms don’t try to do it on Romanji or other non-ascii-coded roman characters?

When else do we scan strings?

> On 30 Apr 2019, at 21:35, John Beard <john.j.beard@xxxxxxxxx> wrote:
> 
> On 30/04/2019 18:19, Jeff Young wrote:
>> I was referring to UCS-2 or UCS-4.  I’m evidently behind the times, though, because I now see that UTF-32 and UCS-4 are equivalent.
>> (Which means that both some of John’s original premises and my quote in teal below were wrong: UTF32 is indeed a one:one map between code points and chars.)
> 
> Kind of, depending the on definition of character. As long as you never get any multi-code point "characters".
> 
>> So my proposal (in 2019) should be std::u32string (using UTF32 encoding, for which myString[3] still works).
> 
> By "works", what do you mean? Sure you can index into a UTF-32 string and come up with a valid (whole) code point (and a valid code unit). But that doesn't mean a lot: it could be the "ᄀ" (\u1100) from 가, which is actually 2 code points.
> 
> How often do we actually index into a string buffer by code point anyway, without iterating the string to find something first? What does that even mean in the context of a Unicode string?
> 
> Graphemes are not a strange and ignorable edge case: emojis may sound silly, but lots of actual languages use grapheme clusters perfectly casually (Tamil, Telegu[1], Hangul as above, etc). You either support Unicode or you don't, you cannot pick and choose what is "reasonable" to support.
> 
> BTW, UTF-8 is does allow you to index into it by byte and see if you're on a code point boundary (if the byte starts 0b10xxxxxx, you are not). You can't index to the n'th code point (but for what purpose?) and you still can't index to the n'th grapheme, but you can't do that in *any* encoding.
> 
>> Better?
> 
> As long as we save our files as UTF-8, I don't really mind what we use internally. But if you actually plan to manipulate strings that could be Unicode and it comes from a user, you cannot do it only by code point, regardless of representation.
> 
> Cheers,
> 
> John
> 
> [1]: Mishandling of Telegu produced the iPhone SMS of Death bug.



Follow ups

References