← Back to team overview

kicad-developers team mailing list archive

Re: The problems with wxString

 

On Wed, Jan 01, 2014 at 12:58:15AM -0600, Dick Hollenbeck wrote:
> > encoding cannot be assumed to be UTF8, even though it often is on linux.  You just cannot
> > assume it.

Yep, sadly Latin-1 is not distinguishable from UTF-8, except by
heuristics...

> > In summary, I don't see any easy immediate relief from the boat anchor we know as
> > wxString, even with wx 3.0.  But I will continue to think about it.
> > 
> > Dick
> 
> 
> Attached is a patch needing a good look, that shows off a new class UTF8 that I wrote that
> solves the problems addressed above by providing conversion operators to and from
> wxString, yet holding UTF8 data in what is basically a std::string.

Yay, new year, new string class...

Yet another :((

You know, if it weren't for wx I would have used std::wstring. It's
standard, directly indexable (no MBCS) and has all the I/O support of
the world. Only problem is the Unix 'industrial standard' is to use
UTF-8 (which is an MBCS), *even in memory*, while it was designed as an
external representation for compatibility/saving bandwidth... not that
the Windows people got it wholly right, they started with Unicode with
NT 3.51 IIRC (when it was 16 bits) and then for 'compatibility' switched
to UCS-2 (or UTF-16, I'm not sure of the details)... still a MBCS! the
worse of both worlds, wastes space in the typical case AND can't do
random access. In fact all the Unicode Transformations weren't meant for
in-core processing. Who knows

Instead of putting true color icons and render-composited graphics they
couldn't allocate more memory for memory for processing text? (not that
these day with 4 and more gigs of core would be a problem anyway :P:P)

End of rant... *still* I feel the ISO/ANSI solution is the best/cleaner
(wchar_t in core, I/O encoding as UTF-8 or whatever).

> a) how it compiles on gcc >= 4.8

I have 4.8.2 here, I'll let you know.

> c) what it does to any benchmarks of sane-ness and speed for stroke_font.h

First of all, a comment on a comment in utf8.h

/// Since the ++ operators advance more than one byte, this is your best
/// loop termination test, < end(), not == end().

The correct way to check for end() is != :D (not necessarily iterators
are ordered, either...). Also it's (theorically) cheaper to check for
equality than an order comparison (never seen an architecture where this
holds for integers anyway... however checking for 0 is often faster in
some architectures)

The stroke font engine doesn't actually do a lot with the input text, in
fact: it only needs a forward_iterator, to workr. NegableTextLength
scans backwards for no serious reason (a really small optimization), it
can be trivially rewritten for forward iteration. The bulk of code is:

- Until there are chars to be plotted
    - Get the next char
    - Do horribly complicated and expensive stuff with it, including
      drawing code.

I don't think we need a benchmark to see that step 2 easily swamps the
time needed for step 1... even if instead of UTF8 we were using stateful
encodings like ISO2022 or shift-JIS!

So I'd say I see no issue with your proposal (it's a lesser evil).

And, by the way, the whole uni_forward() (the core of the iterator) is
just special casing the mbtowc() ISO function for UTF-8 :D (but knowing
the 'compability' of Windows libc I guess there are issues...)
-- 
Lorenzo Marcantonio
Logos Srl


References