← Back to team overview

kicad-developers team mailing list archive

Re: 6.0 string proposal

 

On 30/04/2019 21:55, Jeff Young wrote:
I think we’re a long way from handling Hiragana, Katakana or Kanji.  Probably the same for Tamil or Telegu, although I know nothing about them.

We can already accept Unicode from many string user inputs. For example a net name can be "🇺🇸" (this is 2 code points, but might render as a flag or a "US"). It doesn't render on the schematic (you see ??) but it does make it into the netlist just fine. At least on Linux with a UTF-8 locale.

If we get user fonts in schematics (e.g. to support Chinese users without making our own 10000+ char Hershey font[1], this will Just Work (TM). This is on the Road Map V6[2].

Anywhere text comes in from the user, could be any kind of weird Unicode. Generally 95% of the Unicode handling is taking data from the user, keeping it safe, and then passing it onto something that can deal with the hellish mess of glyphs. Like HarfBuzz[3] (used by toolkits like GTK+ and, directly, by Inkscape). As far as we are concerned, it is opaque data.

So why do we scan strings?  We do it when tokenizing, but all our tokens are roman (if not ascii), so that should be OK.

In this case we'd be iterating the strings anyway, not random access. Parsing s-expressions should be as well defined in Unicode as in ASCII, it's a matter of what the grammar accepts (not that there is a formal grammar, but if there were one, it would probably say ASCII only symbols and anything goes between the quotes of a string, and we'd check for valid UTF-8 at some point, but maybe not in the tokenising stage).

We also do it looking for numbers to increment.  We’d like this to work for other languages, but as long as their subsequent-code-points don’t look like roman digits I think we’re OK.  (Subsequent-code-points are all above ascii, right?).

Subsequent code *units* in a code point (in UTF-8, where a unit is 1 byte and a point is 1-4 bytes) start with 0b10. There's no rule about what order code *points* can come in, but some orders make sense, other orders are nonsense to humans[4].

Parsing some arbitrary sequence of code points and getting something semantically useful out is "hard", and highly domain-specific (like, is 0xAB a number? I would say so, but my non-computery friends will say no. What about 一百零五?). But it would still be done on a iterative basis.

We do some case conversions when doing compares.  But again, as long as subsequent code points don’t look like ascii we should be OK.  I assume capitalization algorithms don’t try to do it on Romanji or other non-ascii-coded roman characters?

They do (e.g. Greek, Cyrillic and there are many other bicameral scripts) and there are sometimes special rules. A common example is ß->SS, so you can't even be sure of the length! This is locale-dependent (e.g. std::toupper listens to the std::locale). This, again, is a "hard" problem, and there are libraries for it, e.g. ICU, if you really need to get it right (e.g. normalizing Unicode sequences to NFC first, etc).

In any case, the capitalization algorithm is a iteration of the string.

When else do we scan strings?

The question is when do we randomly index into strings without having scanned for the index point beforehand. This is actually not a common action when you're dealing with arbitrary user input. You will normally be using some kind of iterative process like "find the offset of the first colon" or "split on the second space" or "uppercase this string" or "replace illegal characters" or something.

Things like string sorting will also "just work" in UTF-8. It's designed that way so that lexicographically sorting by byte is the same as lexicographically sorting by code point[5].

If you're dealing with known or expected text, you can certainly still index into a UTF-8/32 string. But never for text that's come from some Unicode source. It could be anything, even just 50000 zero-width joiners in a row and that silly poo emoji at the end. That is a problem for HarfBuzz.

Yes, it's extremely annoying, but human language is a very complex thing.

Cheers,

John

[1]: https://bugs.launchpad.net/kicad/+bug/594064 (though I think a Hershey Chinese font would be "fun", I don't see it happening soon).
[2]: http://docs.kicad-pcb.org/doxygen/v6_road_map.html#v6_sch_sys_fonts
[3]: https://en.wikipedia.org/wiki/HarfBuzz
[4]: The iPhone SMS of Death was caused by a "nonsense" Unicode code point sequence. [5]: And if you want "real" sorting, well, that's *also* locale dependent: in German DIN 5007-1, ö=o, 5007-2, ö=oe, in Swedish, ö is at the end, after ä.


References