launchpad-reviewers team mailing list archive

Thread
Date

[Merge] lp:~jcsackett/launchpad/linkifier-bugs-2 into lp:launchpad

To: mp+43980@xxxxxxxxxxxxxxxxxx
From: "j.c.sackett" <jonathan.sackett@xxxxxxxxxxxxx>
Date: Thu, 16 Dec 2010 21:32:06 -0000
Reply-to: mp+43980@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

j.c.sackett has proposed merging lp:~jcsackett/launchpad/linkifier-bugs-2 into lp:launchpad.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)
Related bugs:
  #60172 Do not auto-link "<protocol>://" by itself
  https://bugs.launchpad.net/bugs/60172
  #118284 URLs ending with a ) aren't linkified properly
  https://bugs.launchpad.net/bugs/118284


Summary
=======

As part of bugjam2010, a series of bugs having to do with the linkifier have been targeted. This branch deals with standard URI links that need some postprocessing after the regex capture.

URIs are identified in stringformatter via a serious regex, but to do certain things (like reject urls that consist of just a protocol with no domain) you would have to modify the regex to catch invalid urls, which would make it even harder to deal with. Instead it's easier and cleaner to just make decisions about when to linkify based on the output of the regex.

Implementation
==============

Two methods have been added to the FormattersAPI in lib/lp/app/browser/stringformatter.py

The first, _linkify_url_is_blacklisted, adds capacity to reject URIs for linkification if they are in a predefined list. The list is all of the protocols defined in the regex as standalone URIs (e.g. http://, sftp://). This is used in the linkify process to return the URL as text rather than as a link if it's only the protocol.

The second _handle_parens_in_trailers deals with closing parenthesis following the url that should be considered part of the url--e.g. the closing parenthesis is http://wikipedia.com/Fell_(comic). This is used in the linkification process to move the closing parenthesis out of the trailers and back into the URI.

Tests
=====

bin/test -t TestLinkifyingProtocols

QA
==

Create a bug comment on launchpad.dev with the following text:

"http:// isn't a link. http://example.com/something_(with_parens) should be fully linked."

The first http:// shouldn't link. The second URI should link, including the closing paren at the end.

Lint
====


= Launchpad lint =

Checking for conflicts and issues in changed files.

Linting changed files:
  lib/lp/app/browser/stringformatter.py
  lib/lp/app/browser/tests/test_stringformatter.py

./lib/lp/app/browser/stringformatter.py
     450: E501 line too long (90 characters)
     454: E501 line too long (88 characters)
     450: Line exceeds 78 characters.
     454: Line exceeds 78 characters.

The two lines that are too long are part of the regex, which is difficult enough to read without breaking up those sections.
-- 
https://code.launchpad.net/~jcsackett/launchpad/linkifier-bugs-2/+merge/43980
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~jcsackett/launchpad/linkifier-bugs-2 into lp:launchpad.

=== modified file 'lib/lp/app/browser/stringformatter.py'
--- lib/lp/app/browser/stringformatter.py	2010-12-14 17:33:05 +0000
+++ lib/lp/app/browser/stringformatter.py	2010-12-16 21:31:33 +0000
@@ -201,6 +201,22 @@
         return '&nbsp;' * len(groups[0])
 
     @staticmethod
+    def _handle_parens_in_trailers(url, trailers):
+        opencount = url.count('(')
+        closedcount = url.count(')')
+        missing = opencount - closedcount
+        slice_idx = 0
+        while slice_idx < missing:
+            if trailers[slice_idx] == ')':
+                slice_idx += 1
+            else:
+                break
+        url += trailers[:slice_idx]
+        trailers = trailers[slice_idx:]
+        return url, trailers
+        
+
+    @staticmethod
     def _split_url_and_trailers(url):
         """Given a URL return a tuple of the URL and punctuation trailers.
 
@@ -216,7 +232,7 @@
             url = url[:-len(trailers)]
         else:
             trailers = ''
-        return url, trailers
+        return FormattersAPI._handle_parens_in_trailers(url, trailers)
 
     @staticmethod
     def _linkify_bug_number(text, bugnum, trailers=''):
@@ -227,6 +243,30 @@
         return '<a href="%s">%s</a>%s' % (url, text, trailers)
 
     @staticmethod
+    def _linkify_url_is_blacklisted(url):
+        '''Don't linkify URIs consisting of just the protocol'''
+        
+        blacklist_bases = [
+            'about',
+            'gopher',
+            'http',
+            'https',
+            'sftp',
+            'news',
+            'ftp',
+            'mailto',
+            'irc',
+            'jabber',
+            'apt',
+            'data',
+            ]
+
+        for base in blacklist_bases:
+            if url in ('%s' % base, '%s:' % base, '%s://' % base):
+                return True
+        return False
+
+    @staticmethod
     def _linkify_substitution(match):
         if match.group('bug') is not None:
             return FormattersAPI._linkify_bug_number(
@@ -235,16 +275,19 @@
             # The text will already have been cgi escaped.  We temporarily
             # unescape it so that we can strip common trailing characters
             # that aren't part of the URL.
-            url = match.group('url')
-            url, trailers = FormattersAPI._split_url_and_trailers(url)
+            full_url = match.group('url')
+            url, trailers = FormattersAPI._split_url_and_trailers(full_url)
             # We use nofollow for these links to reduce the value of
             # adding spam URLs to our comments; it's a way of moderately
             # devaluing the return on effort for spammers that consider
             # using Launchpad.
-            return '<a rel="nofollow" href="%s">%s</a>%s' % (
-                cgi.escape(url, quote=True),
-                add_word_breaks(cgi.escape(url)),
-                cgi.escape(trailers))
+            if not FormattersAPI._linkify_url_is_blacklisted(url):
+                return '<a rel="nofollow" href="%s">%s</a>%s' % (
+                    cgi.escape(url, quote=True),
+                    add_word_breaks(cgi.escape(url)),
+                    cgi.escape(trailers))
+            else:
+                return full_url
         elif match.group('faq') is not None:
             # This is *BAD*.  We shouldn't be doing database lookups to
             # linkify text.

=== modified file 'lib/lp/app/browser/tests/test_stringformatter.py'
--- lib/lp/app/browser/tests/test_stringformatter.py	2010-12-14 19:56:40 +0000
+++ lib/lp/app/browser/tests/test_stringformatter.py	2010-12-16 21:31:33 +0000
@@ -109,8 +109,66 @@
 
 class TestLinkifyingProtocols(TestCase):
 
+    def test_normal_set(self):
+        test_strings = [
+            "http://example.com";,
+            "http://example.com/";,
+            "http://example.com/path";,
+            "http://example.com/path/";,
+            ]
+        
+        expected_strings = [
+            ('<p><a rel="nofollow" href="http://example.com";>'
+             'http://<wbr></wbr>example.<wbr></wbr>com</a></p>'),
+            ('<p><a rel="nofollow" href="http://example.com/";>'
+             'http://<wbr></wbr>example.<wbr></wbr>com/</a></p>'),
+            ('<p><a rel="nofollow" href="http://example.com/path";>'
+             'http://<wbr></wbr>example.<wbr></wbr>com/path</a></p>'),
+            ('<p><a rel="nofollow" href="http://example.com/path/";>'
+             'http://<wbr></wbr>example.<wbr></wbr>com/path/</a></p>'),
+            ]
+
+        self.assertEqual(
+            expected_strings,
+            [FormattersAPI(text).text_to_html() for text in test_strings])
+    
+    def test_parens_handled_well(self):
+        test_strings = [
+            '(http://example.com)',
+            'http://example.com/path_(with_parens)',
+            '(http://example.com/path_(with_parens))',
+            ]
+
+        expected_html = [
+            ('<p>(<a rel="nofollow" href="http://example.com";>'
+             'http://<wbr></wbr>example.<wbr></wbr>com</a>)</p>'),
+            ('<p><a rel="nofollow" href="http://example.com/path_(with_parens)">'
+             'http://<wbr></wbr>example.<wbr></wbr>com/path_'
+             '<wbr></wbr>(with_parens)</a></p>'),
+            ('<p>(<a rel="nofollow" '
+             'href="http://example.com/path_(with_parens)">'
+             'http://<wbr></wbr>example.<wbr></wbr>com/path_'
+             '<wbr></wbr>(with_parens)</a>)</p>'),
+            ]
+        
+        self.assertEqual(
+            expected_html,
+            [FormattersAPI(text).text_to_html() for text in test_strings])
+
+
+    def test_protocol_alone_does_not_link(self):
+        test_string = "This doesn't link: apt:"
+        html = FormattersAPI(test_string).text_to_html()
+        expected_html = "<p>This doesn't link: apt:</p>"
+        self.assertEqual(expected_html, html)
+
+        test_string = "This doesn't link: http://";
+        html = FormattersAPI(test_string).text_to_html()
+        expected_html = "<p>This doesn't link: http://</p>"
+        self.assertEqual(expected_html, html)
+
     def test_apt_is_linked(self):
-        test_string = 'This becomes a link: apt:some-package'
+        test_strings = 'This becomes a link: apt:some-package'
         html = FormattersAPI(test_string).text_to_html()
         expected_html = (
             '<p>This becomes a link: '