← Back to team overview

launchpad-dev team mailing list archive

Re: Unicode and Launchpad

 

On 2011-10-27 16:32, Julian Edwards wrote:

The things we discussed on the call were fairly simple:

  * Keep all strings as unicode internally (with the exception of plain ASCII
strings which are easily coerced to unicode automatically)
  * Convert to/from unicode only when necessary (e.g. utf8 byte string or MIME)
at the point the string *exits or enters* Launchpad.

This is something I've been longing for, but it never became enough of a priority.

Converting to unicode raises three questions that I suspect we've skirted in many places:

1. Are we relying on conversion errors as protection against non-ASCII input that we don't know how to handle?

2. What encoding?

3. So it fails.  Now what?

We should start by drawing a clear line between "raw str" and "decoded unicode" in our existing code. Where this becomes difficult, we can break it down into steps that can be treated as separate bugs:

  i) Decode as ascii.  No change, except it's explicit.
 ii) Fail sensibly when data is non-decodable.
iii) If the data's use is unicode-safe, decode as utf-8.
 iv) Worry about format-specific encoding directives.

That's assuming that it's generally safe to decode as utf-8, i.e. that non-utf-8 text will probably give you a proper exception rather than bad data — and definitely give you a proper exception rather than dangerous data.


Jeroen


References