linuxdcpp-team team mailing list archive

Thread
Date

[Bug 1649066] Re: Invalid UTF-8 data is not always being rejected

To: linuxdcpp-team@xxxxxxxxxxxxxxxxxxx
From: eMTee <1649066@xxxxxxxxxxxxxxxxxx>
Date: Tue, 25 Jun 2024 14:56:17 -0000
Reply-to: Bug 1649066 <1649066@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

I did some performance tests regarding the changes this patch does
(using the optimized release build).

After a few startups to get the file fully cached by Windows, the load
of a hashindex.xml with 80k lines, repeated 10 times with and without
utf8 validation each, results overhead of an average 3.1% when
validation is enabled.

While I was at it I also tested the change of Text::wideToUtf8 and
utf8ToWide to the WinAPI versions, added in
https://bugs.launchpad.net/dcplusplus/+bug/1473791 presumably to get
better Unicode version support for the new function in the patch and in
general thoughout the program.

Looks like the Win32 and the old (now for unix only) versions perform pretty much the same in a wideToUtf8(utf8ToWide(str)) operation done on all the data used in the above test scenario. 
Suprisingly the Win32 version of wideToUtf8 looks to be even a few percent faster than the previously used one. So adding the Win API versions seems to be justified especially since they also appear to support UTF-16 surrogate pairs so they should work better for far-east locales.

-- 
You received this bug notification because you are a member of
Dcplusplus-team, which is subscribed to DC++.
https://bugs.launchpad.net/bugs/1649066

Title:
  Invalid UTF-8 data is not always being rejected

Status in AirDC++:
  Fix Released
Status in DC++:
  Fix Committed

Bug description:
  There are various cases where invalid UTF-8 data is being consumed by
  the core:

  1. Text::convert will return the original string in case of errors (Linux only, respective Windows-specific functions will return an empty string in case of errors)
  2. When using "utf-8" encoding in NMDC hubs, the original string will always be returned by conversion functions without validation (generally Linux only since "utf-8" can't be selected from DC++'s GUI)
  3. UTF-8 validation is not performed for strings parsed from XML (specifically file/directory names in filelists)

  This will cause issues especially when the data is processed by
  external sources/libraries that expect to receive valid UTF-8 data
  (https://github.com/airdcpp-web/airdcpp-webclient/issues/204). I'm not
  sure about security implications.

  Another note: messages that fail UTF-8 validation in ADC hubs are
  ignored silently. At least Flexhub seems to be having problems with
  data validation which currently goes unnoticed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/airdcpp/+bug/1649066/+subscriptions

References

[Bug 1649066] [NEW] Invalid UTF-8 data is not always being rejected
From: maksis, 2016-12-11