← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~julian-edwards/launchpad/lost-builder-bug-463046 into lp:launchpad/devel

 

Julian Edwards has proposed merging lp:~julian-edwards/launchpad/lost-builder-bug-463046 into lp:launchpad/devel.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)
Related bugs:
  #463046 Rescuing a BUILDING builder just makes things worse
  https://bugs.launchpad.net/bugs/463046


= Summary =
Reset aborted builders that don't have any buildqueue entry

== Proposed fix ==
See bug 436046.

Basically, when builders end up building something that we don't know about, 
they need to be aborted and cleaned.  Right now they sit in the "aborted" 
state forever because the current handler for that relies on there being a 
buildqueue row for the build job.  In the case of comms failures etc., that 
may not be the case.

== Pre-implementation notes ==
Talked to wgrant.

== Implementation details ==
There's an existing function that checks for the slave state and does things 
based on what it returns to us.  I've changed it so that if it sees the 
ABORTED state and builder.currentjob is None, then we need to clean it - 
slave.clean() just resets it back to IDLE, ready for another job).

== Tests ==
bin/test -cvvt Test_rescueBuilderIfLost

== Demo and Q/A ==
The QA plan on dogfood is as follows:

 * Initiate a build
 * Stop the buildd-manager to give us time to fiddle about.
 * Rip all traces of the build out of dogfood's database by deleting the 
buildqueue and job.
 * Start the buildd-manager - when it scans the builder with the orphaned 
build it should now reset it successfully.

= Launchpad lint =

(this lint is all crazy nonsense, we need to fix our linter!)

Checking for conflicts and issues in changed files.

Linting changed files:
  lib/lp/buildmaster/model/builder.py
  lib/lp/buildmaster/tests/test_builder.py

./lib/lp/buildmaster/model/builder.py
     191: E302 expected 2 blank lines, found 1
     201: E202 whitespace before '}'
     421: E231 missing whitespace after ','
     432: E231 missing whitespace after ','
     557: E301 expected 1 blank line, found 0
     609: E231 missing whitespace after ','
     730: E231 missing whitespace after ','
./lib/lp/buildmaster/tests/test_builder.py
     354: E301 expected 1 blank line, found 2
-- 
https://code.launchpad.net/~julian-edwards/launchpad/lost-builder-bug-463046/+merge/33369
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~julian-edwards/launchpad/lost-builder-bug-463046 into lp:launchpad/devel.
=== modified file 'lib/lp/buildmaster/model/builder.py'
--- lib/lp/buildmaster/model/builder.py	2010-08-20 20:31:18 +0000
+++ lib/lp/buildmaster/model/builder.py	2010-08-23 11:01:11 +0000
@@ -206,6 +206,20 @@
     # IBuilder.slaveStatusSentence().
     status = status_sentence[0]
 
+    # If the cookie test below fails, it will request an abort of the
+    # builder.  This will leave the builder in the aborted state and
+    # with no assigned job, and we should now "clean" the slave which
+    # will reset its state back to IDLE, ready to accept new builds.
+    # This situation is usually caused by a temporary loss of
+    # communications with the slave and the build manager had to reset
+    # the job.
+    if status == 'BuilderStatus.ABORTED' and builder.currentjob is None:
+        builder.cleanSlave()
+        if logger is not None:
+            logger.info(
+                "Builder '%s' cleaned up from ABORTED" % builder.name)
+        return
+
     # If slave is not building nor waiting, it's not in need of rescuing.
     if status not in ident_position.keys():
         return
@@ -220,7 +234,7 @@
         else:
             builder.requestAbort()
         if logger:
-            logger.warn(
+            logger.info(
                 "Builder '%s' rescued from '%s': '%s'" %
                 (builder.name, slave_build_id, reason))
 

=== modified file 'lib/lp/buildmaster/tests/test_builder.py'
--- lib/lp/buildmaster/tests/test_builder.py	2010-08-20 20:31:18 +0000
+++ lib/lp/buildmaster/tests/test_builder.py	2010-08-23 11:01:11 +0000
@@ -29,6 +29,10 @@
 from lp.soyuz.model.binarypackagebuildbehavior import (
     BinaryPackageBuildBehavior,
     )
+from lp.soyuz.tests.soyuzbuilddhelpers import (
+    AbortedSlave,
+    MockBuilder,
+    )
 from lp.soyuz.tests.test_publishing import SoyuzTestPublisher
 from lp.testing import TestCaseWithFactory
 from lp.testing.fakemethod import FakeMethod
@@ -90,6 +94,26 @@
         self.assertEqual(0, builder.handleTimeout.call_count)
 
 
+class Test_rescueBuilderIfLost(TestCaseWithFactory):
+    """Tests for lp.buildmaster.model.builder.rescueBuilderIfLost."""
+
+    layer = LaunchpadZopelessLayer
+
+    def test_recovery_of_aborted_slave(self):
+        # If a slave is in the ABORTED state, rescueBuilderIfLost should
+        # clean it if we don't think it's currently building anything.
+        # See bug 463046.
+        aborted_slave = AbortedSlave()
+        # The slave's clean() method is normally an XMLRPC call, so we
+        # can just stub it out and check that it got called.
+        aborted_slave.clean = FakeMethod()
+        builder = MockBuilder("mock_builder", aborted_slave)
+        builder.currentjob = None
+        builder.rescueIfLost()
+
+        self.assertEqual(1, aborted_slave.clean.call_count)
+
+
 class TestFindBuildCandidateBase(TestCaseWithFactory):
     """Setup the test publisher and some builders."""