launchpad-reviewers team mailing list archive

Thread
Date

[Merge] ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master

To: mp+481785@xxxxxxxxxxxxxxxxxx
From: İlker Emre Koç <mp+481785@xxxxxxxxxxxxxxxxxx>
Date: Mon, 24 Feb 2025 11:50:30 -0000
Reply-to: mp+481785@xxxxxxxxxxxxxxxxxx
Sender: noreply@xxxxxxxxxxxxx

İlker Emre Koç has proposed merging ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master.

Commit message:
Add CRLF acceptance into FileBugDataParser

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)

For more details, see:
https://code.launchpad.net/~ilkeremrekoc/launchpad/+git/launchpad/+merge/481785

FileBugDataParser is the only parser that works on ApportJobs i.e. jobs from Apport package. Originally, only the LF line-break was expected from Apport, and a hard-coded logic was put on it within FileBugDataParser. The problem is, this was in violation of the RFC standards for HTTP request formats, which only accepts CRLF as its linebreaks. Because of this, an upgrade to multipart package which parsed multipart/form-data requests broke the apport reporting system by becoming more strict in its parsing by refusing to parse LF line-endings.
    
Long story short, to upgrade multipart we need to patch apport package to only send CRLF line-breaks, but to patch apport package we need to make FileBugDataParser accept CRLF line-breaks in the first place.
    
This commit is intended to make FileBugDataParser accept CRLF while retaining LFs for backward compatibility.
    
Some of the things I needed to do to ensure this is in need of discussion though. Mainly, the way I implemented the "reset" to the input blob stream is less-than-perfect as I only had to do that since the Librarian-File-Alias class which FileBugDataParser uses doesn't have a "seek" functionality, requiring a "close()" -> "open()" call within the method instead.
    
Another discussion point could be the inheritance of BytesIO as MockBytesIO inside one of the testing modules, mainly because BytesIO doesn't have any "open()" method but has a "close()" method that act differently than the ones in file-alias, requiring a simple implementation of a child class for BytesIO that implemented an "open()" method.
    
I also added a new test-file into the codebase to ensure we have a message file that uses CRLF for its line-breaks so that we can test the ApportJobs better.

-- 
Your team Launchpad code reviewers is requested to review the proposed merge of ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master.

diff --git a/lib/lp/bugs/tests/test_apportjob.py b/lib/lp/bugs/tests/test_apportjob.py
index 0ba8c7d..33385f8 100644
--- a/lib/lp/bugs/tests/test_apportjob.py
+++ b/lib/lp/bugs/tests/test_apportjob.py
@@ -333,6 +333,29 @@ class ProcessApportBlobJobTestCase(TestCaseWithFactory):
         self._assertFileBugDataMatchesDict(filebug_data, processed_data)
 
 
+class ProcessApportBlobJobWithCRLFTestCase(ProcessApportBlobJobTestCase):
+    """
+    Test case for the ProcessApportBlobJob class when the input blob is using
+    CRLF as its line-break.
+
+    It is especially important to test this here as the CRLF line-breaks break
+    the FileBugDataParser which is only run inside the ProcessApportBlobJob's
+    "run()" method.
+    """
+
+    layer = LaunchpadZopelessLayer
+
+    def setUp(self):
+        super().setUp()
+
+        # Create a BLOB using existing testing data.
+
+        self.blob = self.factory.makeBlob(
+            blob_file="extra_filebug_data_with_crlf.msg"
+        )
+        transaction.commit()  # We need the blob available from the Librarian.
+
+
 class TestViaCelery(TestCaseWithFactory):
     layer = CeleryJobLayer
 
diff --git a/lib/lp/bugs/tests/testfiles/extra_filebug_data_with_crlf.msg b/lib/lp/bugs/tests/testfiles/extra_filebug_data_with_crlf.msg
new file mode 100644
index 0000000..e25b184
--- /dev/null
+++ b/lib/lp/bugs/tests/testfiles/extra_filebug_data_with_crlf.msg
@@ -0,0 +1,37 @@
+MIME-Version: 1.0 
+Content-type: multipart/mixed; boundary=boundary
+
+--boundary
+Content-disposition: inline
+Content-type: text/plain; charset=utf-8
+
+This should be added to the description.
+
+--boundary
+Content-disposition: inline
+Content-type: text/plain; charset=utf-8
+
+This should be added as a comment.
+
+--boundary
+Content-disposition: attachment; filename='attachment1'
+Content-type: text/plain; charset=utf-8
+
+This is an attachment.
+
+--boundary
+Content-disposition: inline
+Content-type: text/plain; charset=utf-8
+
+This should be added as another comment.
+
+--boundary
+Content-disposition: attachment; filename='attachment2'
+Content-description: Attachment description.
+Content-Transfer-Encoding: base64
+Content-Type: application/x-gzip
+
+H4sICH5op0UAA3Rlc3QudHh0AAvJyCxWAKLEvPySjNQihcSSksTkjNzUvBIdhfLMkgyFRIWU1OLk
+osyCksz8PD0uAPiw9nkwAAAA
+
+--boundary--
diff --git a/lib/lp/bugs/utilities/filebugdataparser.py b/lib/lp/bugs/utilities/filebugdataparser.py
index 4806dcb..fa8f5a1 100644
--- a/lib/lp/bugs/utilities/filebugdataparser.py
+++ b/lib/lp/bugs/utilities/filebugdataparser.py
@@ -32,6 +32,85 @@ class FileBugDataParser:
         self.attachments = []
         self.BUFFER_SIZE = 8192
 
+    def _findLineBreakType(self) -> bytes:
+        """Find the line-break/end_string the message is using and return it.
+
+        Assumptions:
+            - The method would be run at the start of any parsing run as this
+            method doesn't open the blob file but resets the blob-stream's
+            cursor.
+
+            - The line-break can only be LF ("\n") or CRLF ("\r\n") with
+            no CR ("\r") functionality.
+
+            - The message must have at least one line-break character when
+            parsing. An empty message or one without any line-breaks doesn't
+            count (which should be impossible anyway but I digress).
+
+            - The whole message must be made up of a single line-break type.
+            If more than one type is present, something will break after this
+            method.
+
+        Reads through the message until it finds an LF character ("\n") before
+        closing and reopening the file-alias/blob to reset its cursor (as the
+        file-alias/blob doesn't have a "seek" functionality). Then returns
+        whichever line-break type is used in the message.
+
+        ...
+        :raises AssertionError: If the method cannot find any LF linebreak on
+        the message stream, it raises.
+        ...
+        :return: Either byte typed CRLF (b"\r\n") or byte typed LF (b"\n")
+        :rtype: bytes
+        """
+
+        # The LF line break is assumed to be a part of the message as it is
+        # a part of both LF and CRLF.
+        lf_linebreak = b"\n"
+
+        # A temporary buffer we don't need to save outside the scope as it is
+        # only for finding the first line-break.
+        temp_buffer = b""
+
+        while lf_linebreak not in temp_buffer:
+            data = self.blob_file.read(self.BUFFER_SIZE)
+
+            # Append the data to the temp_buffer. This is to ensure any
+            # CR ("\r") that is part of a CRLF isn't deleted in a preceding
+            # buffer accidentally.
+            temp_buffer += data
+
+            if len(data) < self.BUFFER_SIZE:
+                # End of file.
+
+                if lf_linebreak not in temp_buffer:
+                    # If the linebreak isn't present, then the message must
+                    # be broken since the method must read from the start
+
+                    if (
+                        b"\r" in temp_buffer
+                    ):  # If the linebreak inside the whole message is only CR
+                        raise AssertionError(
+                            "The wrong linebreak is used. CR isn't accepted."
+                        )
+
+                    raise AssertionError(
+                        "There are no linebreaks in the blob."
+                    )
+                break
+
+        self.blob_file.close()
+        self.blob_file.open()
+
+        lf_index = temp_buffer.index(lf_linebreak)
+
+        # A slice is needed even if for a single character as bytes type acts
+        # differently in slices and in singe character accesses.
+        if temp_buffer[lf_index - 1 : lf_index] == b"\r":
+            return b"\r\n"
+
+        return b"\n"
+
     def _consumeBytes(self, end_string):
         """Read bytes from the message up to the end_string.
 
@@ -56,16 +135,16 @@ class FileBugDataParser:
         self._buffer = self._buffer[end_index + len(end_string) :]
         return bytes
 
-    def readHeaders(self):
+    def readHeaders(self, blob_linebreak=b"\n"):
         """Read the next set of headers of the message."""
-        header_text = self._consumeBytes(b"\n\n")
+        header_text = self._consumeBytes(blob_linebreak + blob_linebreak)
         # Use the email package to return a dict-like object of the
         # headers, so we don't have to parse the text ourselves.
         return email.message_from_bytes(header_text)
 
-    def readLine(self):
+    def readLine(self, blob_linebreak=b"\n"):
         """Read a line of the message."""
-        data = self._consumeBytes(b"\n")
+        data = self._consumeBytes(blob_linebreak)
         if data == b"":
             raise AssertionError("End of file reached.")
         return data
@@ -106,28 +185,31 @@ class FileBugDataParser:
         file on the file system. After using the returned data,
         removeTemporaryFiles() must be called.
         """
-        headers = self.readHeaders()
+        # Find the linebreak type the message uses
+        linebreak = self._findLineBreakType()
+
+        headers = self.readHeaders(linebreak)
         data = FileBugData()
         self._setDataFromHeaders(data, headers)
 
         # The headers is a Message instance.
         boundary = b"--" + six.ensure_binary(headers.get_param("boundary"))
-        line = self.readLine()
+        line = self.readLine(linebreak)
         while not line.startswith(boundary + b"--"):
             part_file = tempfile.TemporaryFile()
-            part_headers = self.readHeaders()
+            part_headers = self.readHeaders(linebreak)
             content_encoding = part_headers.get("Content-Transfer-Encoding")
             if content_encoding is not None and content_encoding != "base64":
                 raise AssertionError(
                     "Unknown encoding: %r." % content_encoding
                 )
-            line = self.readLine()
+            line = self.readLine(linebreak)
             while not line.startswith(boundary):
                 # Decode the file.
                 if content_encoding == "base64":
                     line = base64.b64decode(line)
                 part_file.write(line)
-                line = self.readLine()
+                line = self.readLine(linebreak)
             # Prepare the file for reading.
             part_file.seek(0)
             disposition = part_headers["Content-Disposition"]
diff --git a/lib/lp/bugs/utilities/tests/test_filebugdataparser.py b/lib/lp/bugs/utilities/tests/test_filebugdataparser.py
index 28d4be9..e35809b 100644
--- a/lib/lp/bugs/utilities/tests/test_filebugdataparser.py
+++ b/lib/lp/bugs/utilities/tests/test_filebugdataparser.py
@@ -18,16 +18,37 @@ from lp.bugs.utilities.filebugdataparser import FileBugDataParser
 from lp.testing import TestCase
 
 
+# ilkeremrekoc 27-02-2025:
+# This class is for mocking the open() method of librarian-file-alias that
+# doesn't exist in BytesIO. BytesIO is only used for testing purposes as the
+# production only uses LibrarianFileAlias. Thus, this additional functionality
+# won't affect the production.
+class MockBytesIO(io.BytesIO):
+
+    # Added a storage variable since the close() method used in
+    # FileBugDataParser flushes the BytesIO completely from memory making it
+    # impossible to re-open. Thus, "self.initial_bytes" let's us re-open the
+    # same input later.
+    def __init__(self, initial_bytes=b""):
+        self.initial_bytes = initial_bytes
+        super().__init__(initial_bytes)
+
+    # open() method doesn't exist in BytesIO originally, we add it to make
+    # sure tests act closer to a file/file-alias input into FileBugDataParser.
+    def open(self):
+        super().__init__(self.initial_bytes)
+
+
 class TestFileBugDataParser(TestCase):
     def test_initial_buffer(self):
         # The parser's buffer starts out empty.
-        parser = FileBugDataParser(io.BytesIO(b"123456789"))
+        parser = FileBugDataParser(MockBytesIO(b"123456789"))
         self.assertEqual(b"", parser._buffer)
 
     def test__consumeBytes(self):
         # _consumeBytes reads from the file until a delimiter is
         # encountered.
-        parser = FileBugDataParser(io.BytesIO(b"123456789"))
+        parser = FileBugDataParser(MockBytesIO(b"123456789"))
         parser.BUFFER_SIZE = 3
         self.assertEqual(b"1234", parser._consumeBytes(b"4"))
         # In order to find the delimiter string, it had to read b"123456"
@@ -44,9 +65,65 @@ class TestFileBugDataParser(TestCase):
         self.assertEqual(b"", parser._consumeBytes(b"0"))
         self.assertEqual(b"", parser._consumeBytes(b"0"))
 
+    def test__findLineBreakType(self):
+        # _findLineBreakType reads the whole message until either an LF or
+        # a CRLF is found, returns that type in bytes string format and exits.
+
+        msg = dedent(
+            """\
+            Header: value
+            Space-Folded-Header: this header
+             is folded with a space.
+            Tab-Folded-Header: this header
+            \tis folded with a tab.
+            Another-header: another-value
+
+            Not-A-Header: not-a-value
+            """
+        )
+
+        # Test the above message with LF endings
+        msg_with_lf = msg.encode("ASCII")
+        lf_parser = FileBugDataParser(MockBytesIO(msg_with_lf))
+        lf_linebreak = lf_parser._findLineBreakType()
+
+        self.assertEqual(lf_linebreak, b"\n")
+
+        # Test the above message after replacing LF with CRLF
+        msg_with_crlf = msg.replace("\n", "\r\n").encode("ASCII")
+        crlf_parser = FileBugDataParser(MockBytesIO(msg_with_crlf))
+        crlf_linebreak = crlf_parser._findLineBreakType()
+
+        self.assertEqual(crlf_linebreak, b"\r\n")
+
+        # Test the above message after deleting the line-breaks
+        # Which should break as there must be at least some line-breaks
+        # in the message
+        msg_without_linebreak = msg.replace("\n", "").encode("ASCII")
+        without_linebreak_parser = FileBugDataParser(
+            MockBytesIO(msg_without_linebreak)
+        )
+
+        self.assertRaisesWithContent(
+            AssertionError,
+            "There are no linebreaks in the blob.",
+            without_linebreak_parser._findLineBreakType,
+        )
+
+        # Test the above message after replacing LF with CR
+        # Which should break as we do not accept CR line-breaks
+        msg_with_cr = msg.replace("\n", "\r").encode("ASCII")
+        cr_parser = FileBugDataParser(MockBytesIO(msg_with_cr))
+
+        self.assertRaisesWithContent(
+            AssertionError,
+            "The wrong linebreak is used. CR isn't accepted.",
+            cr_parser._findLineBreakType,
+        )
+
     def test_readLine(self):
         # readLine reads a single line of the file.
-        parser = FileBugDataParser(io.BytesIO(b"123\n456\n789"))
+        parser = FileBugDataParser(MockBytesIO(b"123\n456\n789"))
         self.assertEqual(b"123\n", parser.readLine())
         self.assertEqual(b"456\n", parser.readLine())
         self.assertEqual(b"789", parser.readLine())
@@ -72,7 +149,7 @@ class TestFileBugDataParser(TestCase):
             Not-A-Header: not-a-value
             """
         ).encode("ASCII")
-        parser = FileBugDataParser(io.BytesIO(msg))
+        parser = FileBugDataParser(MockBytesIO(msg))
         headers = parser.readHeaders()
         self.assertEqual("value", headers["Header"])
         self.assertEqual(
@@ -147,7 +224,7 @@ class TestFileBugDataParser(TestCase):
             --boundary--
             """
         ).encode("ASCII")
-        parser = FileBugDataParser(io.BytesIO(message))
+        parser = FileBugDataParser(MockBytesIO(message))
         data = parser.parse()
         self.assertEqual(
             "This should be added to the description.\n\nAnother line.",
@@ -175,7 +252,7 @@ class TestFileBugDataParser(TestCase):
             """
             % encoded_text
         ).encode("ASCII")
-        parser = FileBugDataParser(io.BytesIO(message))
+        parser = FileBugDataParser(MockBytesIO(message))
         data = parser.parse()
         self.assertEqual(
             "This should be added to the description.\n\nAnother line.",
@@ -214,7 +291,7 @@ class TestFileBugDataParser(TestCase):
             --boundary--
             """
         ).encode("ASCII")
-        parser = FileBugDataParser(io.BytesIO(message))
+        parser = FileBugDataParser(MockBytesIO(message))
         data = parser.parse()
         self.assertEqual(
             [
@@ -251,7 +328,7 @@ class TestFileBugDataParser(TestCase):
             --boundary--
             """
         ).encode("ASCII")
-        parser = FileBugDataParser(io.BytesIO(message))
+        parser = FileBugDataParser(MockBytesIO(message))
         data = parser.parse()
         self.assertEqual(2, len(data.attachments))
         # The filename is copied into the "filename" item.
@@ -303,7 +380,7 @@ class TestFileBugDataParser(TestCase):
             """
             % encoded_data
         ).encode("ASCII")
-        parser = FileBugDataParser(io.BytesIO(message))
+        parser = FileBugDataParser(MockBytesIO(message))
         data = parser.parse()
         self.assertEqual(1, len(data.attachments))
         self.assertEqual(
@@ -330,7 +407,7 @@ class TestFileBugDataParser(TestCase):
 
             Another line."""
         ).encode("ASCII")
-        parser = FileBugDataParser(io.BytesIO(message))
+        parser = FileBugDataParser(MockBytesIO(message))
         self.assertRaisesWithContent(
             AssertionError, "End of file reached.", parser.parse
         )

Follow ups

Re: [Merge] ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master
From: İlker Emre Koç, 2025-02-26
Re: [Merge] ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master
From: Yuliy Schwartzburg, 2025-02-26
Re: [Merge] ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master
From: İlker Emre Koç, 2025-02-26
Re: [Merge] ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master
From: Yuliy Schwartzburg, 2025-02-26
Re: [Merge] ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master
From: Yuliy Schwartzburg, 2025-02-26
[Merge] ~ilkeremrekoc/launchpad:accept-crlf-bugreports into launchpad:master
From: İlker Emre Koç, 2025-02-26