← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~cjwatson/launchpad/buildmaster-cancel-properly into lp:launchpad

 

Colin Watson has proposed merging lp:~cjwatson/launchpad/buildmaster-cancel-properly into lp:launchpad with lp:~cjwatson/launchpad/buildstatus-aborted as a prerequisite.

Commit message:
Cancel builds by sending an "abort" command to the build slave, with a three-minute timeout after which we give up and rescue the builder.  This should now work even on non-virtualised builders.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)
Related bugs:
  Bug #54730 in launchpad-buildd: "launchpad buildd abort does not work as expected"
  https://bugs.launchpad.net/launchpad-buildd/+bug/54730

For more details, see:
https://code.launchpad.net/~cjwatson/launchpad/buildmaster-cancel-properly/+merge/177580

This is the second half of the master side of build cancellation, following on from the first half of the master side in https://code.launchpad.net/~cjwatson/launchpad/buildstatus-aborted/+merge/176990, and the slave side in https://code.launchpad.net/~cjwatson/launchpad-buildd/fix-abort/+merge/177003 (both those branches must be landed first).  With a working slave side we mainly just need to remove previous restrictions and arrange for a timeout in the event that the builder entirely fails to abort.

The bug that had previously led test_abort to be disabled should be fixed by using scan-for-processes on the slave rather than killing a specific process.

We should be able to get rid of most of BuilderStatus.ABORTED once this is all deployed and working.

For QA, we should test on dogfood with both virtualised and non-virtualised builders, using builds that (a) hang "normally" but are reasonably killable and (b) attempt to spawn processes faster than scan-for-processes can kill them.  It might also be a good idea to attempt some kind of fake slave that can be told to enter the ABORTED state without the master asking it to do so.
-- 
https://code.launchpad.net/~cjwatson/launchpad/buildmaster-cancel-properly/+merge/177580
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~cjwatson/launchpad/buildmaster-cancel-properly into lp:launchpad.
=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py	2013-07-30 12:42:28 +0000
+++ lib/lp/buildmaster/manager.py	2013-07-30 12:42:29 +0000
@@ -116,13 +116,25 @@
     # algorithm for polling.
     SCAN_INTERVAL = 15
 
-    def __init__(self, builder_name, logger):
+    # The time before deciding that a cancelling builder has failed, in
+    # seconds.  This should normally be a multiple of SCAN_INTERVAL, and
+    # greater than abort_timeout in launchpad-buildd's slave BuildManager.
+    CANCEL_TIMEOUT = 180
+
+    def __init__(self, builder_name, logger, clock=None):
         self.builder_name = builder_name
         self.logger = logger
+        # Use the clock if provided, so that tests can advance it.  Use the
+        # reactor by default.
+        if clock is None:
+            clock = reactor
+        self._clock = clock
+        self.date_cancel = None
 
     def startCycle(self):
         """Scan the builder and dispatch to it or deal with failures."""
         self.loop = LoopingCall(self.singleCycle)
+        self.loop.clock = self._clock
         self.stopping_deferred = self.loop.start(self.SCAN_INTERVAL)
         return self.stopping_deferred
 
@@ -184,26 +196,46 @@
         If the current build is in status CANCELLING then terminate it
         immediately.
 
-        :return: A deferred whose value is True if we cancelled the build.
+        :return: A deferred whose value is True if we recovered the builder
+            by resuming a slave host, so that there is no need to update its
+            status.
         """
-        if not builder.virtualized:
-            return defer.succeed(False)
         buildqueue = self.builder.getBuildQueue()
         if not buildqueue:
+            self.date_cancel = None
             return defer.succeed(False)
         build = buildqueue.specific_job.build
         if build.status != BuildStatus.CANCELLING:
+            self.date_cancel = None
             return defer.succeed(False)
 
         def resume_done(ignored):
             return defer.succeed(True)
 
+        if self.date_cancel is not None:
+            # The BuildFarmJob will normally set the build's status to
+            # something other than CANCELLING once the builder responds to
+            # the cancel request.  This timeout is in case it doesn't.
+            if self._clock.seconds() >= self.date_cancel:
+                self.logger.info("Build '%s' failed to cancel" % build.title)
+                self.date_cancel = None
+                buildqueue.cancel()
+                transaction.commit()
+                if builder.virtualized:
+                    d = builder.resumeSlaveHost()
+                    d.addCallback(resume_done)
+                    return d
+                else:
+                    builder.failBuilder("Build failed to cancel")
+            else:
+                self.logger.info(
+                    "Waiting for build '%s' to cancel" % build.title)
+            return defer.succeed(False)
+
         self.logger.info("Cancelling build '%s'" % build.title)
-        buildqueue.cancel()
-        transaction.commit()
-        d = builder.resumeSlaveHost()
-        d.addCallback(resume_done)
-        return d
+        builder.requestAbort()
+        self.date_cancel = self._clock.seconds() + self.CANCEL_TIMEOUT
+        return defer.succeed(False)
 
     def scan(self):
         """Probe the builder and update/dispatch/collect as appropriate.

=== modified file 'lib/lp/buildmaster/model/builder.py'
--- lib/lp/buildmaster/model/builder.py	2013-07-30 12:42:28 +0000
+++ lib/lp/buildmaster/model/builder.py	2013-07-30 12:42:29 +0000
@@ -335,16 +335,6 @@
         # communications with the slave and the build manager had to reset
         # the job.
         if status == 'BuilderStatus.ABORTED' and builder.currentjob is None:
-            if not builder.virtualized:
-                # We can't reset non-virtual builders reliably as the
-                # abort() function doesn't kill the actual build job,
-                # only the sbuild process!  All we can do here is fail
-                # the builder with a message indicating the problem and
-                # wait for an admin to reboot it.
-                builder.failBuilder(
-                    "Non-virtual builder in ABORTED state, requires admin to "
-                    "restart")
-                return "dummy status"
             if logger is not None:
                 logger.info(
                     "Builder '%s' being cleaned up from ABORTED" %

=== modified file 'lib/lp/buildmaster/tests/test_builder.py'
--- lib/lp/buildmaster/tests/test_builder.py	2013-07-30 12:42:28 +0000
+++ lib/lp/buildmaster/tests/test_builder.py	2013-07-30 12:42:29 +0000
@@ -310,21 +310,19 @@
         return d.addCallback(check_slave_calls)
 
     def test_recovery_of_aborted_nonvirtual_slave(self):
-        # Nonvirtual slaves in the ABORTED state cannot be reliably
-        # cleaned since the sbuild process doesn't properly kill the
-        # build job.  We test that the builder is marked failed.
+        # If a nonvirtual slave is in the ABORTED state,
+        # rescueBuilderIfLost should clean it if we don't think it's
+        # currently building anything.
         aborted_slave = AbortedSlave()
         builder = MockBuilder("mock_builder", aborted_slave)
         builder.currentjob = None
         builder.virtualized = False
-        builder.builderok = True
         d = builder.rescueIfLost()
 
-        def check_failed(ignored):
-            self.assertFalse(builder.builderok)
-            self.assertNotIn('clean', aborted_slave.call_log)
+        def check_slave_calls(ignored):
+            self.assertIn('clean', aborted_slave.call_log)
 
-        return d.addCallback(check_failed)
+        return d.addCallback(check_slave_calls)
 
     def test_recover_ok_slave(self):
         # An idle slave is not rescued.
@@ -895,10 +893,7 @@
         super(TestSlave, self).setUp()
         self.slave_helper = self.useFixture(SlaveTestHelpers())
 
-    # XXX 2010-10-06 Julian bug=655559
-    # This is failing on buildbot but not locally; it's trying to abort
-    # before the build has started.
-    def disabled_test_abort(self):
+    def test_abort(self):
         slave = self.slave_helper.getClientSlave()
         # We need to be in a BUILDING state before we can abort.
         d = self.slave_helper.triggerGoodBuild(slave)

=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
--- lib/lp/buildmaster/tests/test_manager.py	2013-07-30 12:42:28 +0000
+++ lib/lp/buildmaster/tests/test_manager.py	2013-07-30 12:42:29 +0000
@@ -111,14 +111,14 @@
         self.assertEqual(build.status, BuildStatus.BUILDING)
         self.assertEqual(job.logtail, logtail)
 
-    def _getScanner(self, builder_name=None):
+    def _getScanner(self, builder_name=None, clock=None):
         """Instantiate a SlaveScanner object.
 
         Replace its default logging handler by a testing version.
         """
         if builder_name is None:
             builder_name = BOB_THE_BUILDER_NAME
-        scanner = SlaveScanner(builder_name, BufferLogger())
+        scanner = SlaveScanner(builder_name, BufferLogger(), clock=clock)
         scanner.logger.name = 'slave-scanner'
 
         return scanner
@@ -394,19 +394,14 @@
 
         return d.addCallback(check)
 
+    @defer.inlineCallbacks
     def test_cancelling_a_build(self):
         # When scanning an in-progress build, if its state is CANCELLING
-        # then the build should be stopped and moved to the CANCELLED state.
-
-        # Set up a building slave with a fake resume method so we can see
-        # if it got called later.
-        slave = BuildingSlave(build_id="8-1")
-        call_counter = FakeMethod()
-
-        def fake_resume():
-            call_counter()
-            return defer.succeed((None, None, 0))
-        slave.resume = fake_resume
+        # then the build should be aborted, and eventually stopped and moved
+        # to the CANCELLED state if it does not abort by itself.
+
+        # Set up a mock building slave.
+        slave = BuildingSlave()
 
         # Set the sample data builder building with the slave from above.
         builder = getUtility(IBuilderSet)[BOB_THE_BUILDER_NAME]
@@ -419,6 +414,7 @@
         transaction.commit()
         login(ANONYMOUS)
         buildqueue = builder.currentjob
+        slave.build_id = buildqueue.specific_job.generateSlaveBuildCookie()
         self.assertBuildingJob(buildqueue, builder)
 
         # Now set the build to CANCELLING.
@@ -427,18 +423,28 @@
 
         # Run 'scan' and check its results.
         switch_dbuser(config.builddmaster.dbuser)
-        scanner = self._getScanner()
-        d = scanner.scan()
-
-        # The build state should be cancelled and we should have also
-        # called the resume() method on the slave that resets the virtual
-        # machine.
-        def check_cancelled(ignore, builder, buildqueue):
-            self.assertEqual(1, call_counter.call_count)
-            self.assertEqual(BuildStatus.CANCELLED, build.status)
-
-        d.addCallback(check_cancelled, builder, buildqueue)
-        return d
+        clock = task.Clock()
+        scanner = self._getScanner(clock=clock)
+        yield scanner.scan()
+
+        # An abort request should be sent.
+        self.assertEqual(1, slave.call_log.count("abort"))
+        self.assertEqual(BuildStatus.CANCELLING, build.status)
+
+        # Advance time a little.  Nothing much should happen.
+        clock.advance(1)
+        yield scanner.scan()
+        self.assertEqual(1, slave.call_log.count("abort"))
+        self.assertEqual(BuildStatus.CANCELLING, build.status)
+
+        # Advance past the timeout.  The build state should be cancelled and
+        # we should have also called the resume() method on the slave that
+        # resets the virtual machine.
+        clock.advance(SlaveScanner.CANCEL_TIMEOUT)
+        yield scanner.scan()
+        self.assertEqual(1, slave.call_log.count("abort"))
+        self.assertEqual(1, slave.call_log.count("resume"))
+        self.assertEqual(BuildStatus.CANCELLED, build.status)
 
 
 class TestCancellationChecking(TestCaseWithFactory):
@@ -452,14 +458,17 @@
         builder_name = BOB_THE_BUILDER_NAME
         self.builder = getUtility(IBuilderSet)[builder_name]
         self.builder.virtualized = True
-        self.scanner = SlaveScanner(builder_name, BufferLogger())
-        self.scanner.builder = self.builder
-        self.scanner.logger.name = 'slave-scanner'
+
+    def _getScanner(self, clock=None):
+        scanner = SlaveScanner(None, BufferLogger(), clock=clock)
+        scanner.builder = self.builder
+        scanner.logger.name = 'slave-scanner'
+        return scanner
 
     def test_ignores_nonvirtual(self):
         # If the builder is nonvirtual make sure we return False.
         self.builder.virtualized = False
-        d = self.scanner.checkCancellation(self.builder)
+        d = self._getScanner().checkCancellation(self.builder)
         return d.addCallback(self.assertFalse)
 
     def test_ignores_no_buildqueue(self):
@@ -467,37 +476,37 @@
         # make sure we return False.
         buildqueue = self.builder.currentjob
         buildqueue.reset()
-        d = self.scanner.checkCancellation(self.builder)
+        d = self._getScanner().checkCancellation(self.builder)
         return d.addCallback(self.assertFalse)
 
     def test_ignores_build_not_cancelling(self):
         # If the active build is not in a CANCELLING state, ignore it.
-        d = self.scanner.checkCancellation(self.builder)
+        d = self._getScanner().checkCancellation(self.builder)
         return d.addCallback(self.assertFalse)
 
+    @defer.inlineCallbacks
     def test_cancelling_build_is_cancelled(self):
-        # If a build is CANCELLING, make sure True is returned and the
-        # slave was resumed.
-        call_counter = FakeMethod()
-
-        def fake_resume():
-            call_counter()
-            return defer.succeed((None, None, 0))
+        # If a build is CANCELLING and the cancel timeout expires, make sure
+        # True is returned and the slave was resumed.
         slave = OkSlave()
-        slave.resume = fake_resume
         self.builder.vm_host = "fake_vm_host"
         self.patch(BuilderSlave, 'makeBuilderSlave', FakeMethod(slave))
         buildqueue = self.builder.currentjob
         build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(buildqueue)
         build.updateStatus(BuildStatus.CANCELLING)
-
-        def check(result):
-            self.assertEqual(1, call_counter.call_count)
-            self.assertTrue(result)
-            self.assertEqual(BuildStatus.CANCELLED, build.status)
-
-        d = self.scanner.checkCancellation(self.builder)
-        return d.addCallback(check)
+        clock = task.Clock()
+        scanner = self._getScanner(clock=clock)
+
+        result = yield scanner.checkCancellation(self.builder)
+        self.assertNotIn("resume", slave.call_log)
+        self.assertFalse(result)
+        self.assertEqual(BuildStatus.CANCELLING, build.status)
+
+        clock.advance(SlaveScanner.CANCEL_TIMEOUT)
+        result = yield scanner.checkCancellation(self.builder)
+        self.assertEqual(1, slave.call_log.count("resume"))
+        self.assertTrue(result)
+        self.assertEqual(BuildStatus.CANCELLED, build.status)
 
 
 class TestBuilddManager(TestCase):

=== modified file 'lib/lp/soyuz/model/binarypackagebuild.py'
--- lib/lp/soyuz/model/binarypackagebuild.py	2013-06-20 05:50:00 +0000
+++ lib/lp/soyuz/model/binarypackagebuild.py	2013-07-30 12:42:29 +0000
@@ -382,8 +382,6 @@
         """See `IBuild`."""
         if not self.buildqueue_record:
             return False
-        if self.buildqueue_record.virtualized is False:
-            return False
 
         cancellable_statuses = [
             BuildStatus.BUILDING,


Follow ups