kicad-developers team mailing list archive

Thread
Date

Re: Re: We should decide a quoting convention...

To: kicad-devel@xxxxxxxxxxxxxxx
From: Alain Mouette <alainm@...>
Date: Tue, 22 Dec 2009 23:33:45 -0200
In-reply-to: <hgrgh8+ml03@eGroups.com>
User-agent: Thunderbird 2.0.0.23 (X11/20090812)

Lorenzo escreveu:

That said: just define that data files are UTF-8 encoded. The 'keywords' which are ASCII are, by definition, UTF-8 too :D:D

You definetly have a point here, but... there could be a rule forbiddingnon ASCII outside quotes ;)

He said, "go away!"

you simply use

"He said, ""go away!"""


As this is a C++ program, IMVHO it should be \", so

He said:
"go away!"

would become

"He said:\n\"go away!\""

It would be nice to hace as few as possible escapes, as it helpsreadability. I just think that some ideas here were too complex to read,and to implement too.


Note that a rule for CR and LF inside quotes is needed too.

for what my 2c of knowledge are worth, cheers,
Alain


This is really fast AND legible AND easy to write by hand. The facility to split a string between lines is IMHO not useful, BUT the fact can be used to express newlines as literals:

He said:
"go away!"

would become

"He said:
""go away!"""

(yes on two lines) which is tricky to handle with a line-based reader but trivial on a stream (I mostly works with protocols on serial ports, so I usually think in frames, not lines:P).

The other (and IMHO better in this situation) way is to declare an escape character like suggested (typically \) to encode the following one, let's say in hex:

He said:
"go away!"

becomes then

"He said:\0A\22go away!\22"

Still readable and hand writable, good to handle in code (checking the two characters after the \ to be robust) and can representate every codepoint, even control characters: you need to escape only codepoints < 32 and the quote character (and maybe the 80-9F block, the C1 controls... who uses them anyway?). The resulting file would be perfectably editable with any UTF8 editor...

Remember about ignoring UTF markers at the beginning of the file (added by
some windows apps, not added by most linux apps) - otherwise any user
editing the file in notepad will loose his work.

Please note: I think you're talking about the BOM (byte order markey, FF FE or FE FF); remember that while unices usually works externally in UTF-8 and internally in UTF-8 or UCS-4 (char or wchar_t... wx 2.9 can be compiled for either and it's not compatible with the other one!). I think that these days kicad uses UTF-8: we use char, the BOM become unaligned with latin-1 characters (typically the mu letter) and probably it's the problem with truncations in cvpcb. wchar_t, BTW is more or less an unsigned long:P So we have 1-4 bytes for character on disk and, depending on how we use them, 1-4 bytes OR 4 bytes in core.

Of course taking the strlen of an UTF-8 string is taboo:P you can use wcslen for strings of wchar_t (std::wstring uses traits to hide the differences); the easiest way to have the length of an UTF-8 string is to convert it to UCS-4 and use wcslen... also you lose random character access with UTF-8 strings since character size is variable.

Windows on the other hand uses UCS-16 (maybe now it uses UTF-16 with these braindead character surrogates). So every character in memory AND on file (well, IF you handle correctly Unicode, otherwise it's simply the current codepage) is two bytes (it was a common complain that word 2000 files where the double than before: this is the reason); since we apparently can interchange kicad files between unix and windows probably we're just using them in codepage mode i.e. not using Unicode.

That said I have NO IDEA about how wx handles this in 2.8 (and 2.9 depends on how you compile wx itself :(. On unix apparently works simply passing the byte as characters down the libraries... gtk interpretes them as UTF-8, so if the locale is UTF-8 based more or less it works (excluding the gotchas above). In the same way wx on windows probably does the same, so it actually uses codepage values (or maybe the first page of unicode, i.e. Latin 1).

This is only speculation and it needs testing but I suspect that kicad files using extended characters (codepoints > 127) today ARE NOT portable between unix and windows.

I think I've wrote some lesson about encodings:P just hoping I didn't say something TOO wrong (I'm stuck to programming for windows NT4 and I really don't know what wx does under the hood). Anyway I think these are the things to check for in fixing Unicode support in kicad.

------------------------------------

Yahoo! Groups Links

Follow ups

Re: We should decide a quoting convention...
From: Lorenzo, 2009-12-23

References

Re: We should decide a quoting convention...
From: Lorenzo, 2009-12-22