kicad-developers team mailing list archive

Thread
Date

Re: 6.0 string proposal

To: Jeff Young <jeff@xxxxxxxxx>
From: John Beard <john.j.beard@xxxxxxxxx>
Date: Tue, 30 Apr 2019 23:33:59 +0100
Cc: kicad-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <C1368350-7378-4267-97F2-9E654BABFEB4@rokeby.ie>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.0

On 30/04/2019 21:55, Jeff Young wrote:

I think we’re a long way from handling Hiragana, Katakana or Kanji.  Probably the same for Tamil or Telegu, although I know nothing about them.

We can already accept Unicode from many string user inputs. For examplea net name can be "🇺🇸" (this is 2 code points, but might render as aflag or a "US"). It doesn't render on the schematic (you see ??) but itdoes make it into the netlist just fine. At least on Linux with a UTF-8locale.

If we get user fonts in schematics (e.g. to support Chinese userswithout making our own 10000+ char Hershey font[1], this will Just Work(TM). This is on the Road Map V6[2].

Anywhere text comes in from the user, could be any kind of weirdUnicode. Generally 95% of the Unicode handling is taking data from theuser, keeping it safe, and then passing it onto something that can dealwith the hellish mess of glyphs. Like HarfBuzz[3] (used by toolkits likeGTK+ and, directly, by Inkscape). As far as we are concerned, it isopaque data.

So why do we scan strings?  We do it when tokenizing, but all our tokens are roman (if not ascii), so that should be OK.

In this case we'd be iterating the strings anyway, not random access.Parsing s-expressions should be as well defined in Unicode as in ASCII,it's a matter of what the grammar accepts (not that there is a formalgrammar, but if there were one, it would probably say ASCII only symbolsand anything goes between the quotes of a string, and we'd check forvalid UTF-8 at some point, but maybe not in the tokenising stage).

We also do it looking for numbers to increment.  We’d like this to work for other languages, but as long as their subsequent-code-points don’t look like roman digits I think we’re OK.  (Subsequent-code-points are all above ascii, right?).

Subsequent code *units* in a code point (in UTF-8, where a unit is 1byte and a point is 1-4 bytes) start with 0b10. There's no rule aboutwhat order code *points* can come in, but some orders make sense, otherorders are nonsense to humans[4].

Parsing some arbitrary sequence of code points and getting somethingsemantically useful out is "hard", and highly domain-specific (like, is0xAB a number? I would say so, but my non-computery friends will say no.What about 一百零五?). But it would still be done on a iterative basis.

We do some case conversions when doing compares.  But again, as long as subsequent code points don’t look like ascii we should be OK.  I assume capitalization algorithms don’t try to do it on Romanji or other non-ascii-coded roman characters?

They do (e.g. Greek, Cyrillic and there are many other bicameralscripts) and there are sometimes special rules. A common example isß->SS, so you can't even be sure of the length! This is locale-dependent(e.g. std::toupper listens to the std::locale). This, again, is a "hard"problem, and there are libraries for it, e.g. ICU, if you really need toget it right (e.g. normalizing Unicode sequences to NFC first, etc).


In any case, the capitalization algorithm is a iteration of the string.

When else do we scan strings?

The question is when do we randomly index into strings without havingscanned for the index point beforehand. This is actually not a commonaction when you're dealing with arbitrary user input. You will normallybe using some kind of iterative process like "find the offset of thefirst colon" or "split on the second space" or "uppercase this string"or "replace illegal characters" or something.

Things like string sorting will also "just work" in UTF-8. It's designedthat way so that lexicographically sorting by byte is the same aslexicographically sorting by code point[5].

If you're dealing with known or expected text, you can certainly stillindex into a UTF-8/32 string. But never for text that's come from someUnicode source. It could be anything, even just 50000 zero-width joinersin a row and that silly poo emoji at the end. That is a problem forHarfBuzz.


Yes, it's extremely annoying, but human language is a very complex thing.

Cheers,

John

[1]: https://bugs.launchpad.net/kicad/+bug/594064 (though I think aHershey Chinese font would be "fun", I don't see it happening soon).

[2]: http://docs.kicad-pcb.org/doxygen/v6_road_map.html#v6_sch_sys_fonts
[3]: https://en.wikipedia.org/wiki/HarfBuzz

[4]: The iPhone SMS of Death was caused by a "nonsense" Unicode codepoint sequence.[5]: And if you want "real" sorting, well, that's *also* localedependent: in German DIN 5007-1, ö=o, 5007-2, ö=oe, in Swedish, ö is atthe end, after ä.

References

6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: Andrew Lutsenko, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: Wayne Stambaugh, 2019-04-30
Re: 6.0 string proposal
From: Dmitry Salychev, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: John Beard, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: Seth Hillbrand, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: John Beard, 2019-04-30
Re: 6.0 string proposal
From: Jeff Young, 2019-04-30