kicad-developers team mailing list archive

Thread
Date
Re: 6.0 string proposal

To: Dick Hollenbeck <dick@xxxxxxxxxxx>
From: Jeff Young <jeff@xxxxxxxxx>
Date: Fri, 3 May 2019 16:47:36 +0100
Cc: kicad-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <7e785e26-de37-da62-c81f-cac46e35929e@softplc.com>
I did a bit more sleuthing.  Turns out wxUSE_UNICODE_UTF8 is on for OSX.  I misread it because wxUSE_STRING_POS_CACHE is Linux-only:
#if wxUSE_UNICODE_UTF8 && !defined(__WINDOWS__) && !defined(__WXOSX__)
    #define wxUSE_STRING_POS_CACHE 1
#else
    #define wxUSE_STRING_POS_CACHE 0
#endif
… and it works just like John Beard described (ie: it’s only used for index access).

So why was I still getting the crash in IsEmpty() and Length()?  Because it’s not the position cache that’s the problem, it’s the iterators themselves.  Later in string.h we find this (note that all iterators are put in linked list in UTF8 mode, const or otherwise):
#if wxUSE_UNICODE_UTF8
// see the comment near wxString::iterator for why we need this
class WXDLLIMPEXP_BASE wxStringIteratorNode
{
public:
    wxStringIteratorNode()
        : m_str(NULL), m_citer(NULL), m_iter(NULL), m_prev(NULL), m_next(NULL) {}
    wxStringIteratorNode(const wxString *str,
                          wxStringImpl::const_iterator *citer)
        { DoSet(str, citer, NULL); }
    wxStringIteratorNode(const wxString *str, wxStringImpl::iterator *iter)
        { DoSet(str, NULL, iter); }
    ~wxStringIteratorNode()
        { clear(); }

    inline void set(const wxString *str, wxStringImpl::const_iterator *citer)
        { clear(); DoSet(str, citer, NULL); }
    inline void set(const wxString *str, wxStringImpl::iterator *iter)
        { clear(); DoSet(str, NULL, iter); }

    const wxString *m_str;
    wxStringImpl::const_iterator *m_citer;
    wxStringImpl::iterator *m_iter;
    wxStringIteratorNode *m_prev, *m_next;

private:
    inline void clear();
    inline void DoSet(const wxString *str,
                      wxStringImpl::const_iterator *citer,
                      wxStringImpl::iterator *iter);

    // the node belongs to a particular iterator instance, it's not copied
    // when a copy of the iterator is made
    wxDECLARE_NO_COPY_CLASS(wxStringIteratorNode);
};
#endif // wxUSE_UNICODE_UTF8

While this looks bad, there is some very good news here:

1) We don’t need to support extra-BMP characters.  We know this because we never have on Linux.
2) The memory constraints of using hobbled-UTF16 (UCS-2 in reality) aren’t too dire.  (They’re not killing our Linux implementation, anyway.)

So, is wxUSE_UNICODE_UTF8 really only set on OSX?  If so, were all the random crash reports really OSX-only?  (Doesn’t seem likely, but maybe.)

Or is wxUSE_UNICODE_UTF8 also set on Windows?


> On 3 May 2019, at 16:24, Dick Hollenbeck <dick@xxxxxxxxxxx> wrote:
> 
> Could that damage have already been done by prior concurrent access?  Maybe a subsequent
> read only operation fails because your linked list is already hay wired...
> 
> Walking a linked list at a later point in time than when it was damaged can do this..
> 
> 
> 
> On 5/3/19 10:18 AM, Jeff Young wrote:
>> I don’t believe that’s the case.  Neither of the two crashes that I tracked down involved
>> direct index access (or any change to the string).  One was calling foo.IsEmpty(), and the
>> other foo.Length().  Both use const iterators under the hood.
>> 
>> When wxUSE_UNICODE_UTF8 is off, wxWidgets gets around parsing the string by just not doing it:
>> 
>> Remarks
>>    Note that while the behaviour of wxString
>>    <https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>> when |wxUSE_UNICODE_WCHAR==1| resembles
>>    UCS-2 encoding, it's not completely correct to refer to wxString
>>    <https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>> as UCS-2 encoded since you can
>>    encode code points outside the /BMP/ in a wxString
>>    <https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>> as two code units (i.e. as a
>>    surrogate pair; as already mentioned however wxString
>>    <https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>> will "see" them as two
>>    different code points)
>> 
>> 
>>> On 3 May 2019, at 16:06, John Beard <john.j.beard@xxxxxxxxx <mailto:john.j.beard@xxxxxxxxx>
>>> <mailto:john.j.beard@xxxxxxxxx <mailto:john.j.beard@xxxxxxxxx>>> wrote:
>>> 
>>> Hi Jeff,
>>> 
>>> I think it is the index access operator that performs this caching, to allow you to
>>> access the n'th code point any number of times while only iterating the string only once.
>>> 
>>> However, you can still use the iterator access safely. It is only index based access
>>> that is cached and thread-unsafe.
>>> 
>>> This is what the wxString documention recommends. Furthermore, in any Unicode string,
>>> regardless of encoding (8, 16, 32), index access is almost entirely useless anyway, as
>>> code units/points are only indirectly related to glyphs and/or perceived characters
>>> anyway. If you need to parse a Unicode string, you must iterate from the start. There is
>>> no way around it.
>>> 
>>> If we're crashing due to cross thread access by index, the bug is probably that we
>>> access the string by index at all. If this was accessed by iterator, cross thread, and
>>> the string is not changed, it's fine. If the string is changed in another thread, cached
>>> iterators are invalid (same as if you change an C++ container in a single thread. The
>>> standard tells you what iterators are invalidated for each operation on a container).
>>> 
>>> I may have got the wrong end of the wxStick here (I can't check it for myself right
>>> now), but as far as I can tell, this is fixable by just never caching indices, as if we
>>> were looking at a C-style char array, and using iterators instead.
>>> 
>>> We should probably also turn off the unsafe string conversions by defining
>>> wxNO_UNSAFE_WXSTRING_CONV, if it is not already define.
>>> 
>>> Cheers,
>>> 
>>> John
>>> 
>>> On 3 May 2019 16:35:30 CEST, Jeff Young <jeff@xxxxxxxxx <mailto:jeff@xxxxxxxxx> <mailto:jeff@xxxxxxxxx <mailto:jeff@xxxxxxxxx>>> wrote:
>>> 
>>>    Yes, we know exactly why it crashes: in order to speed up iterator access each
>>>    iterator keeps a pointer into the last location accessed (so that i-1 and i+1 can be
>>>    fast).  These pointers are kept in a linked-list.  Adding and removing pointers from
>>>    this list is not thread-protected.
>>> 
>>>    Note that wxWidgets will add/remove a pointer even for something seemingly innocuous
>>>    like an Empty() check.  So doing mutex locks on our side for non-const iterator
>>>    access is not sufficient.
>>> 
>>>    The worst part is that since two threads collide on the same string only rarely, we
>>>    don’t even know how many of these bugs we have.  We’ve fixed 3 or 4 of them (by
>>>    adding our own mutex checking on any access), but are there 0 or 10 more?  Haven’t a
>>>    clue.
>>> 
>>>>>    It is between sad and breath taking.
>>> 
>>>    Indeed.
>>> 
>>>    Cheers,
>>>    Jeff.
>>> 
>>>>    On 3 May 2019, at 15:16, Dick Hollenbeck <dick@xxxxxxxxxxx <mailto:dick@xxxxxxxxxxx>
>>>>    <mailto:dick@xxxxxxxxxxx <mailto:dick@xxxxxxxxxxx>>> wrote:
>>>> 
>>>>    Thanks Jeff.
>>>> 
>>>>    On 5/3/19 4:22 AM, Jeff Young wrote:
>>>>>    Hi Dick,
>>>>> 
>>>>>>>    h) What is the list of deficiencies with current string usage?
>>>>> 
>>>>>    I only have one issue with the current use of wxString, but it’s a big one: it crashes
>>>>>    (unpredictably) when used multi-threaded in UTF8 mode.
>>>> 
>>>>    The fact that it is onely *One* issue is an important data point.
>>>> 
>>>>    Since you know it is crashing in this class, you must know approximately where, and
>>>>    under
>>>>    what kind of read/write activity.  Of course, if read activity triggers a lazy
>>>>    (deferred)
>>>>    transformation, then this distinction can get blurred.  But more information on source
>>>>    file locations would be very helpful to me.
>>>> 
>>>>    Another important data point you brought is that the wx library designers are advising
>>>>    against using wxString for core application.  It will take a couple of hours to even
>>>>    contemplate that, it is basically staggering to me.  It is between sad and breath
>>>>    taking.
>>>>    Sounds like they designed themselves into a corner and are now acknowledging that what
>>>>    they designed is more of an API commitment that they want to disavow than a real
>>>>    solution.
>>>> 
>>>>    I can see where that can happen.  Superior designs come from experience.
>>>>     Experience comes
>>>>    with usage and time, neither of which are always available up front.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>>    This design document makes for fascinating
>>>>>    reading: https://wiki.wxwidgets.org/Development:_UTF-8_Support <https://wiki.wxwidgets.org/Development:_UTF-8_Support>.  It appears that the
>>>>>    current wxString is at least in part modelled on QtString.
>>>>> 
>>>>>    There’s also a bunch of interesting info
>>>>>    here: https://docs.wxwidgets.org/trunk/overview_string.html <https://docs.wxwidgets.org/trunk/overview_string.html>, which I believe is more
>>>>>    up-to-date than the previous link.  In particular, there’s the mention that wxString
>>>>>    handles extra-BMP characters transparently when compiled in UTF8 mode (currently
>>>>>    used by
>>>>>    Kicad), but does NOT when compiled in default mode (in which case the app must handle
>>>>>    surrogate pairs).  This of course directly leads to your point (d):
>>>>> 
>>>>>>>>    d) What does the set of characters that don't fall into UCS2 actually look
>>>>>>>>    like?  How big
>>>>>>>>    is this set, really?  (UTF16 is bigger than UCS2 and picks up the difference.)
>>>>> 
>>>>>    Do we really need to handle extra-BMP characters?
>>>>> 
>>>>>    An even more recent version of the second document
>>>>>    (https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>) finally makes an oblique
>>>>>    reference
>>>>>    to the multi-threading issue by starting with this (rather unhelpful) suggestion:
>>>>> 
>>>>>    Note
>>>>>       While the use of wxString <https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>> is
>>>>>       unavoidable in wxWidgets program, you are encouraged to use the standard string
>>>>>       classes |std::string| or |std::wstring| in your applications and convert them
>>>>>    to and
>>>>>       from wxString <https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>> only when
>>>>>       interacting with wxWidgets.
>>>>> 
>>>>> 
>>>>>    Cheers,
>>>>>    Jeff.
>>>>> 
>>>>> 
>>>>>>    On 3 May 2019, at 02:03, Dick Hollenbeck <dick@xxxxxxxxxxx <mailto:dick@xxxxxxxxxxx>
>>>>>>    <mailto:dick@xxxxxxxxxxx <mailto:dick@xxxxxxxxxxx>> <mailto:dick@xxxxxxxxxxx <mailto:dick@xxxxxxxxxxx>>> wrote:
>>>>>> 
>>>>>>    On 5/2/19 5:32 PM, Dick Hollenbeck wrote:
>>>>>>>    On 4/30/19 4:36 AM, Jeff Young wrote:
>>>>>>>>    We had talked earlier about throwing the wxWidgets UTF8 compile switch to get
>>>>>>>>    rid of
>>>>>>>>    our wxString re-entrancy problems.  However, I noticed that the 6.0 work
>>>>>>>>    packages doc
>>>>>>>>    includes an item for std::string-ization of the BOARD.  (While a lot more work,
>>>>>>>>    this
>>>>>>>>    is a better solution because it also increases our gui-toolkit-choice flexibility.)
>>>>>>>> 
>>>>>>>>    I’d like to propose that we use std::wstring for that.  UTF8 should *only* be an
>>>>>>>>    encoding format (similar to s-expr).  It should never be used internally.
>>>>>>>>    That’s what
>>>>>>>>    unicode wchar_t’s are for.
>>>>>>>> 
>>>>>>>>    And I’d like to propose that we extend std::wstring-ization to SCH_ITEM and
>>>>>>>>    LIB_ITEM.
>>>>>>>>     (Then we can get rid of a bunch of our ugly mutex hacks.)
>>>>>>> 
>>>>>>> 
>>>>>>>    I've been looking at this for a few months now.  I think it is so important, that a
>>>>>>>    sub-committee should be formed, and if that committee takes as long as 4 months
>>>>>>>    to come to
>>>>>>>    a recommendation, this would not be too long.  This issue is simply too critical.
>>>>>>> 
>>>>>>>    I would like to volunteer to be on that committee.  For the entire list to
>>>>>>>    participate in
>>>>>>>    this simply does not make sense to me.  I would welcome the opportunity to study
>>>>>>>    this with
>>>>>>>    a team of 5-6 players.  More than that probably leads to anxiety.  Then, given the
>>>>>>>    recommendations, the list would of course have an opportunity to raise questions
>>>>>>>    and take
>>>>>>>    shots, before a strategy is formulated, and before anything is implemented.
>>>>>>> 
>>>>>>>    Again, approximately:
>>>>>>> 
>>>>>>>     committee recommendations -> list approval -> strategy formulation ->
>>>>>>>    implementation
>>>>>>> 
>>>>>>> 
>>>>>>>    Up to now I have looked at many libraries and have [way *too* much] experience
>>>>>>>    in multiple
>>>>>>>    languages on multiple platforms, so I think I can be valuable contributor.
>>>>>>> 
>>>>>>>    The final work product initially would simply be a list of recommendations, that
>>>>>>>    quickly
>>>>>>>    transforms to a strategy thereafter.  This is an enormous undertaking, so I suggest
>>>>>>>    against racing to a solution.  It could look a lot easier than it will
>>>>>>>    ultimately be, as
>>>>>>>    is typical in software development.  But the return on investment needs to be
>>>>>>>    near optimal
>>>>>>>    in the end.
>>>>>>> 
>>>>>>>    Some questions to answer are:
>>>>>>> 
>>>>>>>    a) How did wxString get to its current state?  Is is merely a conglomeration of
>>>>>>>    after
>>>>>>>    thought, or is is anywhere near optimal.
>>>>>>> 
>>>>>>>    b) Why so many forms of it?  Can one form be chosen for all platforms?
>>>>>>> 
>>>>>>>    c) How does wxString it compare to QtString?
>>>>>>> 
>>>>>>>    d) What does the set of characters that don't fall into UCS2 actually look like?
>>>>>>>     How big
>>>>>>>    is this set, really?  (UTF16 is bigger than UCS2 and picks up the difference.)
>>>>>>> 
>>>>>>>    e) For data files, I think UTF8 is fine.  So the change is for RAM manipulation of
>>>>>>>    strings.  Aren't we talking about a RAM resident string that bridges into the GUI
>>>>>>>    seamlessly?
>>>>>>> 
>>>>>>>    f) What does new C++ language support offer?
>>>>>>> 
>>>>>>>    g) What do C++ language designers suggest?
>>>>>> 
>>>>>>    h) What is the list of deficiencies with current string usage?
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>    etc.
>>>>>>> 
>>>>>>>    But this is best continued in a smaller group, as said.
>>>>>>> 
>>>>>>> 
>>>>>>>    The other thing that I bring to this is vast familiarity with KiCad's internal
>>>>>>>    workings,
>>>>>>>    string use cases, and goals.
>>>>>>> 
>>>>>>>    Let me know if I can help.
>>>>>>> 
>>>>>>>    Regards,
>>>>>>> 
>>>>>>>    Dick
>>>>>>> 
>>>>>>> 
>>>>>>>    _______________________________________________
>>>>>>>    Mailing list: https://launchpad.net/~kicad-developers <https://launchpad.net/~kicad-developers>
>>>>>>>    Post to     : kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>
>>>>>>>    <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>>
>>>>>>>    <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>>
>>>>>>>    Unsubscribe : https://launchpad.net/~kicad-developers <https://launchpad.net/~kicad-developers>
>>>>>>>    More help   : https://help.launchpad.net/ListHelp <https://help.launchpad.net/ListHelp>
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>    _______________________________________________
>>>>>>    Mailing list: https://launchpad.net/~kicad-developers <https://launchpad.net/~kicad-developers>
>>>>>>    Post to     : kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>
>>>>>>    <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>>
>>>>>>    <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>>
>>>>>>    Unsubscribe : https://launchpad.net/~kicad-developers <https://launchpad.net/~kicad-developers>
>>>>>>    More help   : https://help.launchpad.net/ListHelp <https://help.launchpad.net/ListHelp>
Follow ups

Re: 6.0 string proposal
From: Jeff Young, 2019-05-03
References

6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: Dick Hollenbeck, 2019-05-02
Re: 6.0 string proposal
From: Dick Hollenbeck, 2019-05-03
Re: 6.0 string proposal
From: Jeff Young, 2019-05-03
Re: 6.0 string proposal
From: Dick Hollenbeck, 2019-05-03
Re: 6.0 string proposal
From: Jeff Young, 2019-05-03
Re: 6.0 string proposal
From: John Beard, 2019-05-03
Re: 6.0 string proposal
From: Jeff Young, 2019-05-03
Re: 6.0 string proposal
From: Dick Hollenbeck, 2019-05-03