zorba-coders team mailing list archive

Thread
Date

[Bug 1024448] Re: data-converter module problems with non utf-8 characters

To: zorba-coders@xxxxxxxxxxxxxxxxxxx
From: "Paul J. Lucas" <1024448@xxxxxxxxxxxxxxxxxx>
Date: Mon, 16 Jul 2012 20:25:26 -0000
Reply-to: Bug 1024448 <1024448@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

I believe what's going on is that byte sequences like \ud83d\udc4a are
supposed to represent UTF-16 surrogate pairs. This is what Dennis
suggests since 1F44A is the Unicode code point represented.

IMHO, this is a bizarre way to do things: use a UTF-8 byte sequence to
encode UTF-16 surrogate pairs. The code-points represented by the
surrogate pairs should just be encoded in UTF-8 directly.

That said, I believe it's probably possible to handle this bizarre case
and "do the right thing."

** Summary changed:

- data-converter module problems with non utf-8 characters
+ JSON parser doesn't recognize UTF-16 surrogate pairs

** Description changed:

- In public Json streams lots of non-utf8 character escapes can be found
- causing some problems when parsing json or tidying the contained html (
- as for example marketed here: http://www.charbase.com/1f44a-unicode-
- fisted-hand-sign ).
- 
- The following example Query causes a whole bunch of problems:
- 
-   import module namespace json = "http://www.zorba-xquery.com/modules/converters/json";;
-   import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";;
-   declare namespace j = "http://john.snelson.org.uk/parsing-json-into-xquery";;
-   let $text := "&lt;p>" || json:parse("{""text"":""Let's get it. \ud83d\udc4a""}")/j:pair[@name="text"]/text() || "&lt;/p>"
-   return html:parse($text)
- 
- Problems:
- 
- 1. html:parse () has return type document-node(), but tries to return an
- empty-sequence in this example (discovered by ghislain)
- 
- * --> moved to bug #1025194 *
- 
- 2. in file src/com/zorba-
- xquery/www/modules/converters/html.xq.src/tidy_wrapper.h function
- createHtmlItem(...) doesn't throw a proper error message (discovered by
- ghislain) which makes debugging really hard. In contrast, parse-xml
- throws a very helpful error:
- 
-   dynamic error [err:FODC0006]: invalid content passed to fn:parse-
- xml(): loader parsing error: Char 0xD83D out of allowed range;
- 
- Could html:parse report the same error?
- 
- * --> moved to bug #1025193 *
- 
- 3. json:parse() doesn't report an error here which is good in my
- opinion. Yet, as these utf-16 (?) encoded characters are used a lot in
- json, would it be possible to transform them into valid utf-8 (e.g.
- \ud83d\udc4a -> &#x1f44a;)?
- 
- Maybe these findings are going to be a problem in Jsoniq as well?
+ The JSON parser doesn't recognize UTF-16 surrogate pairs, e.g., the byte
+ sequence "\ud83d\udc4a" is currently converted to two separate Unicode
+ code-points when it ought to recognize that as a UTF-16 surrogate pair
+ and result in the Unicode code-point of 1F44A.

-- 
You received this bug notification because you are a member of Zorba
Coders, which is the registrant for Zorba.
https://bugs.launchpad.net/bugs/1024448

Title:
  JSON parser doesn't recognize UTF-16 surrogate pairs

Status in Zorba - The XQuery Processor:
  Incomplete

Bug description:
  The JSON parser doesn't recognize UTF-16 surrogate pairs, e.g., the
  byte sequence "\ud83d\udc4a" is currently converted to two separate
  Unicode code-points when it ought to recognize that as a UTF-16
  surrogate pair and result in the Unicode code-point of 1F44A.

To manage notifications about this bug go to:
https://bugs.launchpad.net/zorba/+bug/1024448/+subscriptions

References

[Bug 1024448] [NEW] data-converter module problems with non utf-8 characters
From: Dennis Knochenwefel, 2012-07-13