← Back to team overview

kicad-developers team mailing list archive

Re: Re: We should decide a quoting convention...

 

vladimir_uryvaev wrote:
--- In kicad-devel@xxxxxxxxxxxxxxx, Manveru <manveru@...> wrote:

For example, sequence of "`" (grave) character followed by character with
code n, should be treated as character code (n^0x20). So symbols get these
escape codes: '\n' -> '`J', '"' -> '`b', '#' -> '`c', '`' -> '` ' etc.

Is it reinventing the wheel? This is most common these days in open source
projects having test formats to support UTF-8 with most classic C escape
sequences with \x or \dddd (where x are letters and dddd are codes). Then
string are enclosed in " (double-quotes) and multiline is divided by \↵
(backslash+cr).


It is inventing better wheel, \x and \nnn codes are just a waste of space, as they require 2 times more memory. =) But I have nothing more against this. I've just shown principle. But, why should we use C-style? There are many fine escape coding standards, for example, URL encoding (%xx).

Also, do we really need multiline? Schematics/PCB format is not primarily human readable format. Multiline will only complicate parsers. If human needs to read schematic file, (s)he can turn on line wrapping in text editor.
I also think that quotes(") are redundant in file format. If spaces and linefeeds are escape coded (%20 and %0A), parser just can stop reading text string at space or linefeed.

Keep it simple!


We will keep it simple, and I admit that there are a couple minor holes in the lisp-like format that we need to plug.

In general however, my thinking is this:

Any such file is to be interpreted as a blend of ASCII sequences with intermittent UTF8 sequences. The ASCII sequences are the keywords, '(', and ')' delimiters, everything except a quoted string is ASCII.

The UTF8 sequences are reserved ONLY for quoted strings.

Quoted strings are required for ONLY for tokens which must include either a) one of the ASCII white space characters, or b) a non ASCII character, or c) ')' or '('.

Within a quoted string, it is assumed to be UTF8, no exceptions, and therefore inherently supports all international 16 bit characters.

With this understanding the problem is reduced to quoted strings, and

A) differentiating the leading and trailing quote from a quote character within the quoted string, and

B) as aid for human readability, some consideration might be given to the handling of new lines, so that they do not screw up the pretty indenting that these files typically have.


For example:

(multiline_text "ABC" "DEF" )

the parser can recombine the ABC and DEF into "ABC\nDEF" when it sees T_multiline_text.


B) is up to the grammar designer, as it happens at the parser level, not at the lexer level. Only A) is a DSNLEXER issue.

Another designer might allow

"ABC
DEF"

And decide human readability of this file is not so important.

Dick








Follow ups

References