launchpad-reviewers team mailing list archive

Thread
Date

[Merge] lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops into lp:launchpad/devel

To: mp+33428@xxxxxxxxxxxxxxxxxx
From: Edwin Grubbs <edwin.grubbs@xxxxxxxxxxxxx>
Date: Mon, 23 Aug 2010 17:57:45 -0000
Reply-to: mp+33428@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

Edwin Grubbs has proposed merging lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops into lp:launchpad/devel.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)
Related bugs:
  #615655 UnicodeDecodeError in xmlrpc holdMessage
  https://bugs.launchpad.net/bugs/615655


Summary
-------

This branch fixes an oops caused by nonascii characters in an email
preventing a str from being converted to a unicode object. Normally,
this means the message is spam, but since we are not absolutely certain
that will be the case, we will just escape the offending characters and
let the mailing list manager review the email in Launchpad.

Tests
-----

./bin/test -vv -t canonical.encoding -t message-holds-xmlrpc.txt
-- 
https://code.launchpad.net/~edwin-grubbs/launchpad/bug-615655-unicode-oops/+merge/33428
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~edwin-grubbs/launchpad/bug-615655-unicode-oops into lp:launchpad/devel.

=== modified file 'lib/canonical/encoding.py'
--- lib/canonical/encoding.py	2009-06-25 05:30:52 +0000
+++ lib/canonical/encoding.py	2010-08-23 17:57:44 +0000
@@ -7,7 +7,6 @@
 import re
 import codecs
 import unicodedata
-from htmlentitydefs import codepoint2name
 from cStringIO import StringIO
 
 __all__ = ['guess', 'ascii_smash']
@@ -151,33 +150,6 @@
     return unicode(s, 'ISO-8859-1', 'replace')
 
 
-# def unicode_to_unaccented_str(text):
-#     """Converts a unicode string into an ascii-only str, converting accented
-#     characters to their plain equivalents.
-#
-#     >>> unicode_to_unaccented_str(u'')
-#     ''
-#     >>> unicode_to_unaccented_str(u'foo bar 123')
-#     'foo bar 123'
-#     >>> unicode_to_unaccented_str(u'viva S\xe3o Carlos!')
-#     'viva Sao Carlos!'
-#     """
-#     assert isinstance(text, unicode)
-#     L = []
-#     for char in text:
-#         charnum = ord(char)
-#         codepoint = codepoint2name.get(charnum)
-#         if codepoint is not None:
-#             strchar = codepoint[0]
-#         else:
-#             try:
-#                 strchar = char.encode('ascii')
-#             except UnicodeEncodeError:
-#                 strchar = ''
-#         L.append(strchar)
-#     return ''.join(L)
-
-
 def ascii_smash(unicode_string):
     """Attempt to convert the Unicode string, possibly containing accents,
     to an ASCII string.
@@ -370,6 +342,44 @@
     if match is not None:
         return match.group(1)
 
-    # Something we can"t represent. Return empty string.
+    # Something we can't represent. Return empty string.
     return ""
 
+
+def escape_nonascii_uniquely(bogus_string):
+    """Replace non-ascii characters with a hex representation.
+
+    This is mainly for preventing emails with invalid characters from causing
+    oopses. The nonascii characters could have been removed or just converted
+    to "?", but this provides some insight into what the bogus data was, and
+    it prevents the message-id from two unrelated emails matching because
+    all the nonascii characters have been replaced with the same ascii
+    character.
+
+    Unfortunately, all the strings below are actually part of this
+    function's docstring, so python processes the backslash once before
+    doctest, and then python processes it again when doctest runs the
+    test. This makes it confusing, since four backslashes will get
+    converted into a single ascii character.
+
+    >>> print len('\xa9'), len('\\xa9'), len('\\\\xa9')
+    1 1 4
+    >>> print escape_nonascii_uniquely('hello \xa9')
+    hello \\xa9
+    >>> print escape_nonascii_uniquely('hello \\xa9')
+    hello \\xa9
+
+    This string only has ascii characters, so escape_nonascii_uniquely()
+    actually has no effect.
+
+    >>> print escape_nonascii_uniquely('hello \\\\xa9')
+    hello \\xa9
+    """
+    nonascii_regex = re.compile(r'[\200-\377]')
+    # By encoding the invalid ascii with a backslash, x, and then the
+    # hex value, it makes it easy to decode it by pasting into a python
+    # interpreter. quopri() is not used, since that could caused the
+    # decoding of an email to fail.
+    def quote(match):
+        return '\\x%x' % ord(match.group(0))
+    return nonascii_regex.sub(quote, bogus_string)

=== modified file 'lib/canonical/launchpad/xmlrpc/mailinglist.py'
--- lib/canonical/launchpad/xmlrpc/mailinglist.py	2010-08-20 20:31:18 +0000
+++ lib/canonical/launchpad/xmlrpc/mailinglist.py	2010-08-23 17:57:44 +0000
@@ -8,7 +8,6 @@
     'MailingListAPIView',
     ]
 
-
 import xmlrpclib
 
 from zope.component import getUtility
@@ -16,6 +15,7 @@
 from zope.security.proxy import removeSecurityProxy
 
 from canonical.config import config
+from canonical.encoding import escape_nonascii_uniquely
 from canonical.launchpad.interfaces import (
     EmailAddressStatus,
     IEmailAddressSet,
@@ -240,6 +240,10 @@
         # though it's much more convenient to just pass 8-bit strings.
         if isinstance(bytes, xmlrpclib.Binary):
             bytes = bytes.data
+        # Although it is illegal for an email to have unencoded non-ascii
+        # characters, it is better to let the list owner process the
+        # message than to cause an oops.
+        bytes = escape_nonascii_uniquely(bytes)
         mailing_list = getUtility(IMailingListSet).get(team_name)
         message = getUtility(IMessageSet).fromEmail(bytes)
         mailing_list.holdMessage(message)

=== modified file 'lib/lp/registry/doc/message-holds-xmlrpc.txt'
--- lib/lp/registry/doc/message-holds-xmlrpc.txt	2010-07-13 20:15:26 +0000
+++ lib/lp/registry/doc/message-holds-xmlrpc.txt	2010-08-23 17:57:44 +0000
@@ -226,18 +226,21 @@
 Non-ascii messages
 ==================
 
-Messages with non-ascii in their headers or bodies are not exactly legal (they
-should be encoded) but do occur especially in spam.  These messages can be
-held for moderator approval too.
+Messages with non-ascii in their headers or bodies are not exactly legal
+(they should be encoded) but do occur especially in spam.  These
+messages can be held for moderator approval too. To avoid blowing up
+later if the string is converted to a unicode object, the non-ascii
+characters are replaced.
 
     >>> spam_message = message_from_string("""\
     ... From: Anne \xa9 Person <anne.person@xxxxxxxxxxx>
     ... To: team-one@xxxxxxxxxxxxxxxxxxx
     ... Subject: \xa9 Badgers!
-    ... Message-ID: <fifth-post>
+    ... Message-ID: <fifth-post\xa9>
     ... Date: Fri, 01 Aug 2000 01:08:59 -0000
     ...
     ... Watch out for badgers! \xa9
+    ... Don't double quote characters: =E3=F6=FC
     ... """)
 
     >>> import xmlrpclib
@@ -247,9 +250,10 @@
     True
     >>> commit()
 
-    >>> held_message_spam = message_set.getMessageByMessageID('<fifth-post>')
+    >>> held_message_spam = message_set.getMessageByMessageID(
+    ...     '<fifth-post\\xa9>')
     >>> print held_message_spam.message_id
-    <fifth-post>
+    <fifth-post\xa9>
     >>> print held_message_spam.posted_by.displayname
     Anne Person
 
@@ -258,14 +262,15 @@
     ...     message_content = held_message_spam.posted_message.read()
     ... finally:
     ...     held_message_spam.posted_message.close()
-    >>> message_content.splitlines()
-    ['From: Anne \xa9 Person <anne.person@xxxxxxxxxxx>',
+    >>> print pretty(message_content.splitlines())
+    ['From: Anne \\xa9 Person <anne.person@xxxxxxxxxxx>',
      'To: team-one@xxxxxxxxxxxxxxxxxxx',
-     'Subject: \xa9 Badgers!',
-     'Message-ID: <fifth-post>',
+     'Subject: \\xa9 Badgers!',
+     'Message-ID: <fifth-post\\xa9>',
      'Date: Fri, 01 Aug 2000 01:08:59 -0000',
      '',
-     'Watch out for badgers! \xa9']
+     'Watch out for badgers! \\xa9',
+     "Don't double quote characters: =E3=F6=FC"]
 
     >>> held_message_spam.status
      <DBItem PostedMessageStatus.NEW, (0) New status>