kicad-developers team mailing list archive

Thread
Date

Re: We should decide a quoting convention...

To: kicad-devel@xxxxxxxxxxxxxxx
From: "Lorenzo" <lomarcan@...>
Date: Tue, 22 Dec 2009 22:17:12 -0000
In-reply-to: <936b14d20912221222i186dd37cn63e5329cdecf9b86@...>
User-agent: eGroups-EW/0.82

> IMVHO such method greatly complicates parsing by any outside tool. It would
> be nice to have file format self-descriptive.
> 
> > Within a quoted string, it is assumed to be UTF8, no exceptions, and
> > > therefore inherently supports all international 16 bit characters.

> UTF-8 is able to handle all this as far as I remember, character encoding
> can be nested 4 times.

It's not 'nested' it's simply a multibyte encoding following a specific algorithm: by design standard ASCII (codepoint < 128) is coded as-is, other codepoints are encoded with a progressively higher number of byte (up to 4, as you correctly specify).

In practice, ASCII is 1-byte, latin-1 (and maybe some other, didn't remember) is 2-bytes and the 4-byte form is only for the upper planes. Anyway the algorithm for UTF-8 is trivial, you can find it anywhere...

That said: just define that data files are UTF-8 encoded. The 'keywords' which are ASCII are, by definition, UTF-8 too :D:D

The problem for quoting remains, of course... just decide on one. The simpler I've seen is simply doubling the delimiter (like in pascal or SQL). So to put in quotes the string:

He said, "go away!"

you simply use

"He said, ""go away!"""

This is really fast AND legible AND easy to write by hand. The facility to split a string between lines is IMHO not useful, BUT the fact can be used to express newlines as literals:

He said:
"go away!"

would become

"He said:
""go away!"""

(yes on two lines) which is tricky to handle with a line-based reader but trivial on a stream (I mostly works with protocols on serial ports, so I usually think in frames, not lines:P).

The other (and IMHO better in this situation) way is to declare an escape character like suggested (typically \) to encode the following one, let's say in hex:

He said:
"go away!"

becomes then

"He said:\0A\22go away!\22"

Still readable and hand writable, good to handle in code (checking the two characters after the \ to be robust) and can representate every codepoint, even control characters: you need to escape only codepoints < 32 and the quote character (and maybe the 80-9F block, the C1 controls... who uses them anyway?). The resulting file would be perfectably editable with any UTF8 editor...

> Remember about ignoring UTF markers at the beginning of the file (added by
> some windows apps, not added by most linux apps) - otherwise any user
> editing the file in notepad will loose his work.

Please note: I think you're talking about the BOM (byte order markey, FF FEor FE FF); remember that while unices usually works externally in UTF-8 and internally in UTF-8 or UCS-4 (char or wchar_t... wx 2.9 can be compiled for either and it's not compatible with the other one!). I think that these days kicad uses UTF-8: we use char, the BOM become unaligned with latin-1 characters (typically the mu letter) and probably it's the problem with truncations in cvpcb. wchar_t, BTW is more or less an unsigned long:P So we have 1-4 bytes for character on disk and, depending on how we use them, 1-4 bytes OR 4 bytes in core.

Of course taking the strlen of an UTF-8 string is taboo:P you can use wcslen for strings of wchar_t (std::wstring uses traits to hide the differences); the easiest way to have the length of an UTF-8 string is to convert it toUCS-4 and use wcslen... also you lose random character access with UTF-8 strings since character size is variable.

Windows on the other hand uses UCS-16 (maybe now it uses UTF-16 with these braindead character surrogates). So every character in memory AND on file (well, IF you handle correctly Unicode, otherwise it's simply the current codepage) is two bytes (it was a common complain that word 2000 files where the double than before: this is the reason); since we apparently can interchange kicad files between unix and windows probably we're just using them incodepage mode i.e. not using Unicode.

That said I have NO IDEA about how wx handles this in 2.8 (and 2.9 depends on how you compile wx itself :(. On unix apparently works simply passing the byte as characters down the libraries... gtk interpretes them as UTF-8, so if the locale is UTF-8 based more or less it works (excluding the gotchasabove). In the same way wx on windows probably does the same, so it actually uses codepage values (or maybe the first page of unicode, i.e. Latin 1).

This is only speculation and it needs testing but I suspect that kicad files using extended characters (codepoints > 127) today ARE NOT portable between unix and windows.

I think I've wrote some lesson about encodings:P just hoping I didn't say something TOO wrong (I'm stuck to programming for windows NT4 and I really don't know what wx does under the hood). Anyway I think these are the thingsto check for in fixing Unicode support in kicad.

Follow ups

Re: Re: We should decide a quoting convention...
From: Manveru, 2009-12-23
Re: Re: We should decide a quoting convention...
From: Dick Hollenbeck, 2009-12-23
Re: Re: We should decide a quoting convention...
From: Alain Mouette, 2009-12-23
Re: We should decide a quoting convention...
From: vladimir_uryvaev, 2009-12-22

References

Re: Re: We should decide a quoting convention...
From: Manveru, 2009-12-22