kicad-developers team mailing list archive

Thread
Date
Re: 6.0 string proposal

To: John Beard <john.j.beard@xxxxxxxxx>
From: Jeff Young <jeff@xxxxxxxxx>
Date: Fri, 3 May 2019 16:18:46 +0100
Cc: kicad-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <0F9F1501-2952-497B-B234-A606DD2772C5@gmail.com>
I don’t believe that’s the case.  Neither of the two crashes that I tracked down involved direct index access (or any change to the string).  One was calling foo.IsEmpty(), and the other foo.Length().  Both use const iterators under the hood.

When wxUSE_UNICODE_UTF8 is off, wxWidgets gets around parsing the string by just not doing it:

Remarks
Note that while the behaviour of wxString <https://docs.wxwidgets.org/trunk/classwx_string.html> when wxUSE_UNICODE_WCHAR==1 resembles UCS-2 encoding, it's not completely correct to refer to wxString <https://docs.wxwidgets.org/trunk/classwx_string.html> as UCS-2 encoded since you can encode code points outside the BMP in a wxString <https://docs.wxwidgets.org/trunk/classwx_string.html> as two code units (i.e. as a surrogate pair; as already mentioned however wxString <https://docs.wxwidgets.org/trunk/classwx_string.html> will "see" them as two different code points)


> On 3 May 2019, at 16:06, John Beard <john.j.beard@xxxxxxxxx> wrote:
> 
> Hi Jeff,
> 
> I think it is the index access operator that performs this caching, to allow you to access the n'th code point any number of times while only iterating the string only once.
> 
> However, you can still use the iterator access safely. It is only index based access that is cached and thread-unsafe.
> 
> This is what the wxString documention recommends. Furthermore, in any Unicode string, regardless of encoding (8, 16, 32), index access is almost entirely useless anyway, as code units/points are only indirectly related to glyphs and/or perceived characters anyway. If you need to parse a Unicode string, you must iterate from the start. There is no way around it.
> 
> If we're crashing due to cross thread access by index, the bug is probably that we access the string by index at all. If this was accessed by iterator, cross thread, and the string is not changed, it's fine. If the string is changed in another thread, cached iterators are invalid (same as if you change an C++ container in a single thread. The standard tells you what iterators are invalidated for each operation on a container).
> 
> I may have got the wrong end of the wxStick here (I can't check it for myself right now), but as far as I can tell, this is fixable by just never caching indices, as if we were looking at a C-style char array, and using iterators instead.
> 
> We should probably also turn off the unsafe string conversions by defining wxNO_UNSAFE_WXSTRING_CONV, if it is not already define.
> 
> Cheers,
> 
> John
> 
> On 3 May 2019 16:35:30 CEST, Jeff Young <jeff@xxxxxxxxx> wrote:
> Yes, we know exactly why it crashes: in order to speed up iterator access each iterator keeps a pointer into the last location accessed (so that i-1 and i+1 can be fast).  These pointers are kept in a linked-list.  Adding and removing pointers from this list is not thread-protected.
> 
> Note that wxWidgets will add/remove a pointer even for something seemingly innocuous like an Empty() check.  So doing mutex locks on our side for non-const iterator access is not sufficient.
> 
> The worst part is that since two threads collide on the same string only rarely, we don’t even know how many of these bugs we have.  We’ve fixed 3 or 4 of them (by adding our own mutex checking on any access), but are there 0 or 10 more?  Haven’t a clue.
> 
>>> It is between sad and breath taking.
> 
> Indeed.
> 
> Cheers,
> Jeff.
> 
>> On 3 May 2019, at 15:16, Dick Hollenbeck <dick@xxxxxxxxxxx <mailto:dick@xxxxxxxxxxx>> wrote:
>> 
>> Thanks Jeff.
>> 
>> On 5/3/19 4:22 AM, Jeff Young wrote:
>>> Hi Dick,
>>> 
>>>>> h) What is the list of deficiencies with current string usage?
>>> 
>>> I only have one issue with the current use of wxString, but it’s a big one: it crashes
>>> (unpredictably) when used multi-threaded in UTF8 mode.
>> 
>> The fact that it is onely *One* issue is an important data point.
>> 
>> Since you know it is crashing in this class, you must know approximately where, and under
>> what kind of read/write activity.  Of course, if read activity triggers a lazy (deferred)
>> transformation, then this distinction can get blurred.  But more information on source
>> file locations would be very helpful to me.
>> 
>> Another important data point you brought is that the wx library designers are advising
>> against using wxString for core application.  It will take a couple of hours to even
>> contemplate that, it is basically staggering to me.  It is between sad and breath taking.
>> Sounds like they designed themselves into a corner and are now acknowledging that what
>> they designed is more of an API commitment that they want to disavow than a real solution.
>> 
>> I can see where that can happen.  Superior designs come from experience.  Experience comes
>> with usage and time, neither of which are always available up front.
>> 
>> 
>> 
>> 
>> 
>>> 
>>> This design document makes for fascinating
>>> reading: https://wiki.wxwidgets.org/Development:_UTF-8_Support <https://wiki.wxwidgets.org/Development:_UTF-8_Support>.  It appears that the
>>> current wxString is at least in part modelled on QtString.
>>> 
>>> There’s also a bunch of interesting info
>>> here: https://docs.wxwidgets.org/trunk/overview_string.html <https://docs.wxwidgets.org/trunk/overview_string.html>, which I believe is more
>>> up-to-date than the previous link.  In particular, there’s the mention that wxString
>>> handles extra-BMP characters transparently when compiled in UTF8 mode (currently used by
>>> Kicad), but does NOT when compiled in default mode (in which case the app must handle
>>> surrogate pairs).  This of course directly leads to your point (d):
>>> 
>>>>>> d) What does the set of characters that don't fall into UCS2 actually look like?  How big
>>>>>> is this set, really?  (UTF16 is bigger than UCS2 and picks up the difference.)
>>> 
>>> Do we really need to handle extra-BMP characters?
>>> 
>>> An even more recent version of the second document
>>> (https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>) finally makes an oblique reference
>>> to the multi-threading issue by starting with this (rather unhelpful) suggestion:
>>> 
>>> Note
>>>    While the use of wxString <https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>> is
>>>    unavoidable in wxWidgets program, you are encouraged to use the standard string
>>>    classes |std::string| or |std::wstring| in your applications and convert them to and
>>>    from wxString <https://docs.wxwidgets.org/trunk/classwx_string.html <https://docs.wxwidgets.org/trunk/classwx_string.html>> only when
>>>    interacting with wxWidgets.
>>> 
>>> 
>>> Cheers,
>>> Jeff.
>>> 
>>> 
>>>> On 3 May 2019, at 02:03, Dick Hollenbeck <dick@xxxxxxxxxxx <mailto:dick@xxxxxxxxxxx> <mailto:dick@xxxxxxxxxxx <mailto:dick@xxxxxxxxxxx>>> wrote:
>>>> 
>>>> On 5/2/19 5:32 PM, Dick Hollenbeck wrote:
>>>>> On 4/30/19 4:36 AM, Jeff Young wrote:
>>>>>> We had talked earlier about throwing the wxWidgets UTF8 compile switch to get rid of
>>>>>> our wxString re-entrancy problems.  However, I noticed that the 6.0 work packages doc
>>>>>> includes an item for std::string-ization of the BOARD.  (While a lot more work, this
>>>>>> is a better solution because it also increases our gui-toolkit-choice flexibility.)
>>>>>> 
>>>>>> I’d like to propose that we use std::wstring for that.  UTF8 should *only* be an
>>>>>> encoding format (similar to s-expr).  It should never be used internally. That’s what
>>>>>> unicode wchar_t’s are for.
>>>>>> 
>>>>>> And I’d like to propose that we extend std::wstring-ization to SCH_ITEM and LIB_ITEM.
>>>>>>  (Then we can get rid of a bunch of our ugly mutex hacks.)
>>>>> 
>>>>> 
>>>>> I've been looking at this for a few months now.  I think it is so important, that a
>>>>> sub-committee should be formed, and if that committee takes as long as 4 months to come to
>>>>> a recommendation, this would not be too long.  This issue is simply too critical.
>>>>> 
>>>>> I would like to volunteer to be on that committee.  For the entire list to participate in
>>>>> this simply does not make sense to me.  I would welcome the opportunity to study this with
>>>>> a team of 5-6 players.  More than that probably leads to anxiety.  Then, given the
>>>>> recommendations, the list would of course have an opportunity to raise questions and take
>>>>> shots, before a strategy is formulated, and before anything is implemented.
>>>>> 
>>>>> Again, approximately:
>>>>> 
>>>>>  committee recommendations -> list approval -> strategy formulation -> implementation
>>>>> 
>>>>> 
>>>>> Up to now I have looked at many libraries and have [way *too* much] experience in multiple
>>>>> languages on multiple platforms, so I think I can be valuable contributor.
>>>>> 
>>>>> The final work product initially would simply be a list of recommendations, that quickly
>>>>> transforms to a strategy thereafter.  This is an enormous undertaking, so I suggest
>>>>> against racing to a solution.  It could look a lot easier than it will ultimately be, as
>>>>> is typical in software development.  But the return on investment needs to be near optimal
>>>>> in the end.
>>>>> 
>>>>> Some questions to answer are:
>>>>> 
>>>>> a) How did wxString get to its current state?  Is is merely a conglomeration of after
>>>>> thought, or is is anywhere near optimal.
>>>>> 
>>>>> b) Why so many forms of it?  Can one form be chosen for all platforms?
>>>>> 
>>>>> c) How does wxString it compare to QtString?
>>>>> 
>>>>> d) What does the set of characters that don't fall into UCS2 actually look like?  How big
>>>>> is this set, really?  (UTF16 is bigger than UCS2 and picks up the difference.)
>>>>> 
>>>>> e) For data files, I think UTF8 is fine.  So the change is for RAM manipulation of
>>>>> strings.  Aren't we talking about a RAM resident string that bridges into the GUI
>>>>> seamlessly?
>>>>> 
>>>>> f) What does new C++ language support offer?
>>>>> 
>>>>> g) What do C++ language designers suggest?
>>>> 
>>>> h) What is the list of deficiencies with current string usage?
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> etc.
>>>>> 
>>>>> But this is best continued in a smaller group, as said.
>>>>> 
>>>>> 
>>>>> The other thing that I bring to this is vast familiarity with KiCad's internal workings,
>>>>> string use cases, and goals.
>>>>> 
>>>>> Let me know if I can help.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Dick
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Mailing list: https://launchpad.net/~kicad-developers <https://launchpad.net/~kicad-developers>
>>>>> Post to     : kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>
>>>>> <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>>
>>>>> Unsubscribe : https://launchpad.net/~kicad-developers <https://launchpad.net/~kicad-developers>
>>>>> More help   : https://help.launchpad.net/ListHelp <https://help.launchpad.net/ListHelp>
>>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Mailing list: https://launchpad.net/~kicad-developers <https://launchpad.net/~kicad-developers>
>>>> Post to     : kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>
>>>> <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx <mailto:kicad-developers@xxxxxxxxxxxxxxxxxxx>>
>>>> Unsubscribe : https://launchpad.net/~kicad-developers <https://launchpad.net/~kicad-developers>
>>>> More help   : https://help.launchpad.net/ListHelp <https://help.launchpad.net/ListHelp>
Follow ups

Re: 6.0 string proposal
From: Dick Hollenbeck, 2019-05-03
References

6.0 string proposal
From: Jeff Young, 2019-04-30
Re: 6.0 string proposal
From: Dick Hollenbeck, 2019-05-02
Re: 6.0 string proposal
From: Dick Hollenbeck, 2019-05-03
Re: 6.0 string proposal
From: Jeff Young, 2019-05-03
Re: 6.0 string proposal
From: Dick Hollenbeck, 2019-05-03
Re: 6.0 string proposal
From: Jeff Young, 2019-05-03
Re: 6.0 string proposal
From: John Beard, 2019-05-03