launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #08220
Re: Unicode and Launchpad
On 2011-10-27 16:32, Julian Edwards wrote:
The things we discussed on the call were fairly simple:
* Keep all strings as unicode internally (with the exception of plain ASCII
strings which are easily coerced to unicode automatically)
* Convert to/from unicode only when necessary (e.g. utf8 byte string or MIME)
at the point the string *exits or enters* Launchpad.
This is something I've been longing for, but it never became enough of a
priority.
Converting to unicode raises three questions that I suspect we've
skirted in many places:
1. Are we relying on conversion errors as protection against non-ASCII
input that we don't know how to handle?
2. What encoding?
3. So it fails. Now what?
We should start by drawing a clear line between "raw str" and "decoded
unicode" in our existing code. Where this becomes difficult, we can
break it down into steps that can be treated as separate bugs:
i) Decode as ascii. No change, except it's explicit.
ii) Fail sensibly when data is non-decodable.
iii) If the data's use is unicode-safe, decode as utf-8.
iv) Worry about format-specific encoding directives.
That's assuming that it's generally safe to decode as utf-8, i.e. that
non-utf-8 text will probably give you a proper exception rather than bad
data — and definitely give you a proper exception rather than dangerous
data.
Jeroen
References