launchpad-dev team mailing list archive

Thread
Date

Re: Unicode and Launchpad

To: Julian Edwards <julian.edwards@xxxxxxxxxxxxx>
From: Jeroen Vermeulen <jtv@xxxxxxxxxxxxx>
Date: Thu, 27 Oct 2011 17:54:29 +0700
Cc: launchpad-dev@xxxxxxxxxxxxxxxxxxx
In-reply-to: <3319193.WigTXoQCXT@beast>
User-agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1

On 2011-10-27 16:32, Julian Edwards wrote:

The things we discussed on the call were fairly simple:

  * Keep all strings as unicode internally (with the exception of plain ASCII
strings which are easily coerced to unicode automatically)
  * Convert to/from unicode only when necessary (e.g. utf8 byte string or MIME)
at the point the string *exits or enters* Launchpad.

This is something I've been longing for, but it never became enough of apriority.

Converting to unicode raises three questions that I suspect we'veskirted in many places:

1. Are we relying on conversion errors as protection against non-ASCIIinput that we don't know how to handle?


2. What encoding?

3. So it fails.  Now what?

We should start by drawing a clear line between "raw str" and "decodedunicode" in our existing code. Where this becomes difficult, we canbreak it down into steps that can be treated as separate bugs:


  i) Decode as ascii.  No change, except it's explicit.
 ii) Fail sensibly when data is non-decodable.
iii) If the data's use is unicode-safe, decode as utf-8.
 iv) Worry about format-specific encoding directives.

That's assuming that it's generally safe to decode as utf-8, i.e. thatnon-utf-8 text will probably give you a proper exception rather than baddata — and definitely give you a proper exception rather than dangerousdata.



Jeroen

References

Unicode and Launchpad
From: Julian Edwards, 2011-10-27