← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~julian-edwards/launchpad/log-parser-bug-680463 into lp:launchpad

 

Julian Edwards has proposed merging lp:~julian-edwards/launchpad/log-parser-bug-680463 into lp:launchpad.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)
Related bugs:
  #680463 Apache log parser crashes out on large gzip files
  https://bugs.launchpad.net/bugs/680463


Figure out the length of gzip log files without having to read them in to memory.

The existing code tries to read the uncompressed contents of a gzip file into memory in their entirety.  This makes the PPA log parser blow up quite horribly as the log files are very large.

Use existing test with:
bin/test -cvv test_apachelogparser Test_get_fd_and_file_size


QA Plan
-------

I have got a copy of the production log files that cause the crash on dogfood.  Running with the fix allows the processing to continue with no increased memory usage as observed in "top".
-- 
https://code.launchpad.net/~julian-edwards/launchpad/log-parser-bug-680463/+merge/41865
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~julian-edwards/launchpad/log-parser-bug-680463 into lp:launchpad.
=== modified file 'lib/lp/services/apachelogparser/base.py'
--- lib/lp/services/apachelogparser/base.py	2010-11-17 23:20:07 +0000
+++ lib/lp/services/apachelogparser/base.py	2010-11-25 14:21:23 +0000
@@ -64,13 +64,14 @@
     file_path points to a gzipped file.
     """
     if file_path.endswith('.gz'):
+        # The last 4 bytes of the file contains the uncompressed file's
+        # size, modulo 2**32.  This code is somewhat stolen from the gzip
+        # module in Python 2.6.
         fd = gzip.open(file_path)
-        # There doesn't seem to be a better way of figuring out the
-        # uncompressed size of a file, so we'll read the whole file here.
-        file_size = len(fd.read())
-        # Seek back to the beginning of the file as if we had just opened
-        # it.
-        fd.seek(0)
+        fd.fileobj.seek(-4, os.SEEK_END)
+        isize = gzip.read32(fd.fileobj)   # may exceed 2GB
+        file_size = isize & 0xffffffffL
+        fd.fileobj.seek(0)
     else:
         fd = open(file_path)
         file_size = os.path.getsize(file_path)