← Back to team overview

kicad-developers team mailing list archive

Re: 6.0 string proposal

 

Thanks for the analysis, John.  However, those numbers are with wxWidget’s performance optimizations (the ones that crash when multi-threaded), so we don’t really know how bad they would be without it.

Also, wxString hides the UTF8 serialization from us and makes for simpler code.  You can use myString[3] and get the 3rd character.  You can’t do that with UTF8 strings.

You are correct that you also can’t do it with UTF32 strings, but I’m not suggesting those.  I’m suggesting *unicode* strings.  That’s 1 code-point per character.  So myString[3] still works. 

(I don’t think graphemes, ligatures, extended-code-points, etc. are a real problem for us.  Heck, our stroke font doesn’t even support 10% of unicode.)

Cheers,
Jeff.

Note: if there’s a std::string library that hides the UTF8 serialization from us I could be talked into using that.  I do agree that it looks like the performance wouldn’t be a deal-breaker.


> On 30 Apr 2019, at 17:22, John Beard <john.j.beard@xxxxxxxxx> wrote:
> 
> On 30/04/2019 16:01, Jeff Young wrote:
>> Primarily for performance reasons.
> 
> WRT performance, I did a few benchmarks for reference (on Linux)
> 
> Loading this large CIAA PCB[1] allocates, out of a peak usage of 467MB of heap with a 0.01% threshold:
> 
> * 9.6MB of std::basic_string<wchar_t>::_M_assign
>   * 9.4MB of this is from wxString operator= assignments
> * ~600kB of std::basic_string<wchar_t>::_M_construct, (wxString ctor)
> 
> So I'm not sure memory usage is a major factor to worry about (strings allocate storage on the heap, so we should see basically all the interesting things in the heap profile). UTF-8 could be as little as 1/4 UTF-32 (all strings are ASCII), but even then, it's a few MB saved.
> 
> Now, in terms of performance, opening Pcbnew with no file gives:
> 
> #4      3.36%	__gconv_transform_utf8_internal	
> #5      2.51%   __mbsrtowcs_l
> #6      2.50%   wxMBConv::ToWChar
> #8      2.07%   std::basic_string<wxhar_t>::_M_assign
> #9      1.88%   wxMBConvStrictUTF8::ToWChar
> #14     1.27%   EscapeString (kicad function)
> #17     0.85%   __GI___strlen_sse2                          #18     0.85%  wxUniChar::From8bit 
> #19     0.84%  wxUniChar::operator==
> 
> And plenty more string-y things in the top 50 or so lines. So it seems the biggest cost for strings is converting them from UTF-8 to wchar_t strings in WX (this is probably not the same on Windows). But it's not really a stunning cost.
> 
> However, loading the CIAA board, and there are basically no string operations above 0.5%, and only a handful even above 0.25%. When doing DRC, strings don't break 0.1%: nearly all the significant work is looking things up in std::maps and geometry.
> 
> So string performance doesn't seem to be *that* critical, as it's quickly drowned out under real workloads. It looks to me (and I'm happy to be corrected, I'm not a perf expert), like string operations in KiCad are not much of a bottleneck.
> 
> > Because characters are different lengths, you have to scan the string
> > to find the n’th character.
> 
> Even with UTF-32, you can only do an O(1) lookup of the n'th *code point* or *code unit* (the same in UTF-32, not in UTF-8), not the n'th *encoded character*.
> 
> That's true even if you normalise the strings first. Not all code points map one-to-one to an encoded character (it can be one-to-none, one-to-one, many-to-one). And that's even without considering grapheme clustering.
> 
> Cheers,
> 
> John
> 
> PS / OT: If we had to optimise one thing, PolygonTriangulation::Vertex::inTriangle is the single hungriest function, chewing 6.19% of all CPU time, double that of each of the next 3: __gnu_cxx::__exchange_and_add (2.76%),  PolygonTriangulation::isEar (2.73%) and even malloc (2.27%).
> 
> Other than that fairly mundane 6%-er, there are no eye-popping performance hogs simply on loading a PCB. Which is nice.
> 
> [1]: https://github.com/ciaa/Hardware/blob/master/PCB/ACC/CIAA_ACC/ciaa_acc.kicad_pcb


Follow ups

References