zorba-coders team mailing list archive
-
zorba-coders team
-
Mailing list archive
-
Message #12296
[Bug 1024448] Re: data-converter module problems with non utf-8 characters
I'm not an encoding expert, so anything I say may potentially be wrong.
The string "\ud83d\udc4a" is an example containing a single javascript
escaped special character (cf http://www.charbase.com/1f44a-unicode-
fisted-hand-sign ). This is very common in JSON data as javascript
engines seem to use encodings utf-16 or ucs-2 internally.
I believe that the json parser attempts to parse "\ud83d\udc4a" as two
single utf-8 characters. As a result, it returns a string containing
invalid codepoints. This can be reproduced with the following query:
import module namespace json = "http://www.zorba-xquery.com/modules/converters/json";
declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery";
json:parse("{""text"":""Let's get it. \ud83d\udc4a""}")/j:pair[@name="text"]/text()
returns:
dynamic error [err:FOCH0001]: "55357":
invalid code point; raised at runtime\zorba\src\api\serialization\serializer.cpp:204
Would it be possible for the json parser to detect utf-16 encoded
characters and convert them into valid utf-8 characters?
--
You received this bug notification because you are a member of Zorba
Coders, which is the registrant for Zorba.
https://bugs.launchpad.net/bugs/1024448
Title:
data-converter module problems with non utf-8 characters
Status in Zorba - The XQuery Processor:
Incomplete
Bug description:
In public Json streams lots of non-utf8 character escapes can be found
causing some problems when parsing json or tidying the contained html
( as for example marketed here: http://www.charbase.com/1f44a-unicode-
fisted-hand-sign ).
The following example Query causes a whole bunch of problems:
import module namespace json = "http://www.zorba-xquery.com/modules/converters/json";
import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";
declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery";
let $text := "<p>" || json:parse("{""text"":""Let's get it. \ud83d\udc4a""}")/j:pair[@name="text"]/text() || "</p>"
return html:parse($text)
Problems:
1. html:parse () has return type document-node(), but tries to return
an empty-sequence in this example (discovered by ghislain)
* --> moved to bug #1025194 *
2. in file src/com/zorba-
xquery/www/modules/converters/html.xq.src/tidy_wrapper.h function
createHtmlItem(...) doesn't throw a proper error message (discovered
by ghislain) which makes debugging really hard. In contrast, parse-xml
throws a very helpful error:
dynamic error [err:FODC0006]: invalid content passed to fn:parse-
xml(): loader parsing error: Char 0xD83D out of allowed range;
Could html:parse report the same error?
* --> moved to bug #1025193 *
3. json:parse() doesn't report an error here which is good in my
opinion. Yet, as these utf-16 (?) encoded characters are used a lot in
json, would it be possible to transform them into valid utf-8 (e.g.
\ud83d\udc4a -> 👊)?
Maybe these findings are going to be a problem in Jsoniq as well?
To manage notifications about this bug go to:
https://bugs.launchpad.net/zorba/+bug/1024448/+subscriptions
References