← Back to team overview

launchpad-dev team mailing list archive

Re: Unicode and Launchpad

 

On 2011-10-27 16:32, Julian Edwards wrote:

The things we discussed on the call were fairly simple:

  * Keep all strings as unicode internally (with the exception of plain ASCII
strings which are easily coerced to unicode automatically)
  * Convert to/from unicode only when necessary (e.g. utf8 byte string or MIME)
at the point the string *exits or enters* Launchpad.
This is something I've been longing for, but it never became enough of a 
priority.
Converting to unicode raises three questions that I suspect we've 
skirted in many places:
1. Are we relying on conversion errors as protection against non-ASCII 
input that we don't know how to handle?
2. What encoding?

3. So it fails.  Now what?

We should start by drawing a clear line between "raw str" and "decoded unicode" in our existing code. Where this becomes difficult, we can break it down into steps that can be treated as separate bugs:
  i) Decode as ascii.  No change, except it's explicit.
 ii) Fail sensibly when data is non-decodable.
iii) If the data's use is unicode-safe, decode as utf-8.
 iv) Worry about format-specific encoding directives.

That's assuming that it's generally safe to decode as utf-8, i.e. that non-utf-8 text will probably give you a proper exception rather than bad data — and definitely give you a proper exception rather than dangerous data.

Jeroen


References