kicad-developers team mailing list archive
Mailing list archive
Re: We should decide a quoting convention...
Wed, 23 Dec 2009 07:59:12 -0000
--- In kicad-devel@xxxxxxxxxxxxxxx, Dick Hollenbeck <dick@...> wrote:
> Re: DSNLEXER
> OK, I spent the last several hours improving DSNLEXER.
> My last posting stands. Keywords must be ASCII, not UTF8. These are
> tokens which are known ahead of time, so there's no reason to stray from
> ASCII, and it is silly trying to sort UTF8 strings. Keywords are sorted
> ahead of time. UTF8 must be quoted and is where you would put app
> specific identifiers.
No problem with that. ASCII technically IS UTF8, you can pick it all with the same lexer without problem. Just don't define a keyword with codepoints over 127, then all the other one become invalid :D
As a side note UTF8 can be sorted (NOT collated, that's WAY more difficult)without problems like ASCII if you want do a binary search (I assume that's your problem). Or just hash them, it's faster too :P (gperf comes to mind)
The fact is you don't have to switch encoding in the file. That's bad sinceALL editors assume only ONE encoding for file. How you handle in memory depends on what kind of character you use:
- Using char you'll have "longer" strings if an extended codepoint comes up. No trouble with this, a trivial strcmp could sort them correctly.
- Using wchar_t you'll have "bigger" characters, so there will be always one char for each codepoint. Even easier! Of course strcmp here would fail, wcscmp is the function for the job here.
All that is for ISO C90, in C++ is even more easier... the first case is with std::string, the second one is with std::wstring. Just use operator < and the traits do their work behind your back.
> So I would like to shift the discussion to grammar, and away from this
> syntax churn / noise. When I say grammar, I am speaking of a sequence
> of tokens & keywords which are returned from the DSNLEXER. Think higher
> level now.
That's another problem... I don't know SPECCTRA requirements, so I assumed we could go with the simpler encoding... well, if it requires multiline string we must put them in, following THEIR rules...
Why just don't pick another SPECCTRA capable program and feed it with an extended character? then just look how it spit it out and go that way...
Or maybe SPECCTRA simply don't support extended codesets (IIRC EDIF is stricly ASCII, if you want some latin extended you simply... can't :D). GNUCAP/spice deck are similar, they're strictly ASCII too (well, GNUCAP even less,net names must be strictly alphanumeric... the / we use for sheet path is rejected).