← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~wgrant/launchpad/resetOrFail-pointless into lp:launchpad

 

William Grant has proposed merging lp:~wgrant/launchpad/resetOrFail-pointless into lp:launchpad.

Commit message:
~wgrant/launchpad/checkCancellation-nicify

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)

For more details, see:
https://code.launchpad.net/~wgrant/launchpad/resetOrFail-pointless/+merge/224259

This branch further simplifies buildd-manager error handling. The only significant change is that virtual builders will now fail after four resets instead of 14.

In the old days (like, say, a week ago), assessFailureCounts would retry a scan five times, then reset a virtual builder or fail a non-virtual one. In the virtual case, it would reset after five and 10 failures, and only fail after 15. But Since SlaveScanner.scan() now respects Builder.clean_status by resetting DIRTY virtual builders at the start of a scan, assessFailureCounts actually ends up implicitly resetting the builder four times, then explicitly once, then implicitly another four times, explicitly once, then implicitly another four, then finally failing the builder on the 15th scan.

Resetting 14 times is probably slightly excessive, so I've simplified the logic to just retry the scan four times and then fail, regardless of virtness. In the virt case this try cause four resets before failure, while the non-virt case will, as always, just have the four chances to magically come good.

This actually simplifies things considerably. assessFailureCounts and _scanFailed no longer return Deferreds, and resetOrFail dies entirely. I also renamed assessFailureCounts to recover_failure, to meet our non-member function naming style and also be more accurate.
-- 
https://code.launchpad.net/~wgrant/launchpad/resetOrFail-pointless/+merge/224259
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~wgrant/launchpad/resetOrFail-pointless into lp:launchpad.
=== modified file 'lib/lp/buildmaster/interactor.py'
--- lib/lp/buildmaster/interactor.py	2014-06-23 14:52:15 +0000
+++ lib/lp/buildmaster/interactor.py	2014-06-24 10:22:57 +0000
@@ -380,45 +380,6 @@
         yield behaviour.dispatchBuildToSlave(build_queue_item.id, logger)
 
     @classmethod
-    def resetOrFail(cls, vitals, slave, builder, logger, exception):
-        """Handle "confirmed" build slave failures.
-
-        Call this when there have been multiple failures that are not just
-        the fault of failing jobs, or when the builder has entered an
-        ABORTED state without having been asked to do so.
-
-        In case of a virtualized builder we ensure it's dirty so the
-        next scan will reset it.
-
-        A non-virtualized builder will be failed straightaway (using
-        `failBuilder`).
-
-        :param logger: The logger object to be used for logging.
-        :param exception: An exception to be used for logging.
-        :return: True if the builder is to be resumed, None otherwise.
-        """
-        error_message = str(exception)
-        if vitals.virtualized:
-            # Virtualized/PPA builder: mark it dirty so the next
-            # scan will attempt a reset.
-            logger.warn(
-                "Dirtying builder: %s -- %s" % (vitals.url, error_message),
-                exc_info=True)
-            if builder.clean_status != BuilderCleanStatus.DIRTY:
-                builder.setCleanStatus(BuilderCleanStatus.DIRTY)
-                transaction.commit()
-            return True
-        else:
-            # XXX: This should really let the failure bubble up to the
-            # scan() method that does the failure counting.
-            # Mark builder as 'failed'.
-            logger.warn(
-                "Disabling builder: %s -- %s" % (vitals.url, error_message))
-            builder.failBuilder(error_message)
-            transaction.commit()
-            return False
-
-    @classmethod
     @defer.inlineCallbacks
     def findAndStartJob(cls, vitals, builder, slave):
         """Find a job to run and send it to the buildd slave.

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py	2014-06-23 14:52:15 +0000
+++ lib/lp/buildmaster/manager.py	2014-06-24 10:22:57 +0000
@@ -52,6 +52,15 @@
 BUILDD_MANAGER_LOG_NAME = "slave-scanner"
 
 
+# The number of times a builder can consecutively fail before we
+# reset its current job.
+JOB_RESET_THRESHOLD = 3
+
+# The number of times a builder can consecutively fail before we
+# mark it builderok=False.
+BUILDER_FAILURE_THRESHOLD = 5
+
+
 class BuilderFactory:
     """A dumb builder factory that just talks to the DB."""
 
@@ -156,22 +165,19 @@
         # fault, it'll error on the next builder and fail out. If the
         # builder is at fault, the job will work fine next time, and the
         # builder will error on the next job and fail out.
-        if not retry or builder_count >= Builder.JOB_RESET_THRESHOLD:
+        if not retry or builder_count >= JOB_RESET_THRESHOLD:
             return (None, True)
     elif builder_count > job_count:
         # The builder has failed more than the job, so the builder is at
-        # fault. We reset the job, and attempt to recover the builder.
-        # We retry a few times, then reset the builder. Then try a few
-        # more times, then reset it again. Then try yet again, after
-        # which we fail.
-        # XXX: Rescanning a virtual builder without resetting is stupid,
-        # as it will be reset anyway due to being dirty and idle.
-        if (builder_count >=
-                Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD):
+        # fault. We reset the job and attempt to recover the builder.
+        if builder_count < BUILDER_FAILURE_THRESHOLD:
+            # Let's dirty the builder and give it a few cycles to
+            # recover. Since it's dirty and idle, this will
+            # automatically attempt a reset if virtual.
+            return (True, True)
+        else:
+            # We've retried too many times, so fail the builder.
             return (False, True)
-        elif builder_count % Builder.RESET_THRESHOLD == 0:
-            return (True, True)
-        return (None, True)
     else:
         # The job has failed more than the builder. Fail it.
         return (None, False)
@@ -180,24 +186,32 @@
     return (None, None)
 
 
-@defer.inlineCallbacks
-def assessFailureCounts(logger, vitals, builder, slave, interactor, retry,
-                        exception):
-    """View builder/job failure_count and work out which needs to die.
-
-    :return: A Deferred that fires either immediately or after a virtual
-        slave has been reset.
-    """
+def recover_failure(logger, vitals, builder, retry, exception):
+    """Recover from a scan failure by slapping the builder or job."""
     del get_property_cache(builder).currentjob
     job = builder.currentjob
+
+    # If a job is being cancelled we won't bother retrying a failure.
+    # Just mark it as cancelled and clear the builder for normal cleanup.
+    cancelling = job is not None and job.status == BuildQueueStatus.CANCELLING
+
+    # judge_failure decides who is guilty and their sentences. We're
+    # just the executioner.
     builder_action, job_action = judge_failure(
         builder.failure_count, job.specific_build.failure_count if job else 0,
-        exception, retry=retry)
+        exception, retry=retry and not cancelling)
 
-    if job is not None:
+    if job is not None and job_action is not None:
         if job_action == False:
+            # We've decided the job is bad, so unblame the builder.
+            builder.resetFailureCount()
+
+        if cancelling:
+            # We've previously been asked to cancel the job, so just set
+            # it to cancelled rather than retrying or failing.
+            job.markAsCancelled()
+        elif job_action == False:
             # Fail and dequeue the job.
-            builder.resetFailureCount()
             job.specific_build.updateStatus(BuildStatus.FAILEDTOBUILD)
             job.destroySelf()
         elif job_action == True:
@@ -209,11 +223,10 @@
         # little choice but to give up.
         builder.failBuilder(str(exception))
     elif builder_action == True:
-        # The builder is dead, but in the virtual case it might be worth
-        # resetting it.
-        yield interactor.resetOrFail(
-            vitals, slave, builder, logger, exception)
-    del get_property_cache(builder).currentjob
+        # Dirty the builder to attempt recovery. In the virtual case,
+        # the dirty idleness will cause a reset, giving us a good chance
+        # of recovery.
+        builder.setCleanStatus(BuilderCleanStatus.DIRTY)
 
 
 class SlaveScanner:
@@ -293,16 +306,12 @@
     def _updateDateScanned(self, ignored):
         self.date_scanned = datetime.datetime.utcnow()
 
-    @defer.inlineCallbacks
     def _scanFailed(self, retry, failure):
         """Deal with failures encountered during the scan cycle.
 
         1. Print the error in the log
         2. Increment and assess failure counts on the builder and job.
            If asked to retry, a single failure may not be considered fatal.
-
-        :return: A Deferred that fires either immediately or after a virtual
-            slave has been reset.
         """
         # Make sure that pending database updates are removed as it
         # could leave the database in an inconsistent state (e.g. The
@@ -327,9 +336,7 @@
         builder = self.builder_factory[self.builder_name]
         try:
             builder.handleFailure(self.logger)
-            yield assessFailureCounts(
-                self.logger, vitals, builder, self.slave_factory(vitals),
-                self.interactor_factory(), retry, failure.value)
+            recover_failure(self.logger, vitals, builder, retry, failure.value)
             transaction.commit()
         except Exception:
             # Catastrophic code failure! Not much we can do.
@@ -339,60 +346,38 @@
             transaction.abort()
 
     @defer.inlineCallbacks
-    def checkCancellation(self, vitals, slave, interactor):
+    def checkCancellation(self, vitals, slave):
         """See if there is a pending cancellation request.
 
         If the current build is in status CANCELLING then terminate it
         immediately.
 
-        :return: A deferred whose value is True if we recovered the builder
-            by resuming a slave host, so that there is no need to update its
-            status.
+        :return: A deferred which fires when this cancellation cycle is done.
         """
-        if vitals.build_queue is None:
-            self.date_cancel = None
-            defer.returnValue(False)
         if vitals.build_queue.status != BuildQueueStatus.CANCELLING:
             self.date_cancel = None
-            defer.returnValue(False)
-
-        try:
-            if self.date_cancel is None:
-                self.logger.info(
-                    "Cancelling BuildQueue %d (%s) on %s",
-                    vitals.build_queue.id, self.getExpectedCookie(vitals),
-                    vitals.name)
-                yield slave.abort()
-                self.date_cancel = self._clock.seconds() + self.CANCEL_TIMEOUT
-                defer.returnValue(False)
-            else:
-                # The BuildFarmJob will normally set the build's status to
-                # something other than CANCELLING once the builder responds to
-                # the cancel request.  This timeout is in case it doesn't.
-                if self._clock.seconds() < self.date_cancel:
-                    self.logger.info(
-                        "Waiting for BuildQueue %d (%s) on %s to cancel",
-                        vitals.build_queue.id, self.getExpectedCookie(vitals),
-                        vitals.name)
-                    defer.returnValue(False)
-                else:
-                    raise BuildSlaveFailure(
-                        "Timeout waiting for BuildQueue %d (%s) on %s to "
-                        "cancel" % (
-                        vitals.build_queue.id, self.getExpectedCookie(vitals),
-                        vitals.name))
-        except Exception as e:
+        elif self.date_cancel is None:
             self.logger.info(
-                "Failure while cancelling BuildQueue %d (%s) on %s",
+                "Cancelling BuildQueue %d (%s) on %s",
                 vitals.build_queue.id, self.getExpectedCookie(vitals),
                 vitals.name)
-            self.date_cancel = None
-            vitals.build_queue.markAsCancelled()
-            transaction.commit()
-            resumed = yield interactor.resetOrFail(
-                vitals, slave, self.builder_factory[vitals.name], self.logger,
-                e)
-            defer.returnValue(resumed)
+            yield slave.abort()
+            self.date_cancel = self._clock.seconds() + self.CANCEL_TIMEOUT
+        else:
+            # The BuildFarmJob will normally set the build's status to
+            # something other than CANCELLING once the builder responds to
+            # the cancel request.  This timeout is in case it doesn't.
+            if self._clock.seconds() < self.date_cancel:
+                self.logger.info(
+                    "Waiting for BuildQueue %d (%s) on %s to cancel",
+                    vitals.build_queue.id, self.getExpectedCookie(vitals),
+                    vitals.name)
+            else:
+                raise BuildSlaveFailure(
+                    "Timeout waiting for BuildQueue %d (%s) on %s to "
+                    "cancel" % (
+                    vitals.build_queue.id, self.getExpectedCookie(vitals),
+                    vitals.name))
 
     def getExpectedCookie(self, vitals):
         """Return the build cookie expected to be held by the slave.
@@ -461,10 +446,7 @@
                 transaction.commit()
                 return
 
-            cancelled = yield self.checkCancellation(
-                vitals, slave, interactor)
-            if cancelled:
-                return
+            yield self.checkCancellation(vitals, slave)
 
             # The slave and DB agree on the builder's state.  Scan the
             # slave and get the logtail, or collect the build if it's

=== modified file 'lib/lp/buildmaster/model/builder.py'
--- lib/lp/buildmaster/model/builder.py	2014-06-20 11:52:32 +0000
+++ lib/lp/buildmaster/model/builder.py	2014-06-24 10:22:57 +0000
@@ -107,19 +107,6 @@
     vm_reset_protocol = EnumCol(enum=BuilderResetProtocol)
     date_clean_status_changed = UtcDateTimeCol()
 
-    # The number of times a builder can consecutively fail before we
-    # reset its current job.
-    JOB_RESET_THRESHOLD = 3
-
-    # The number of times a builder can consecutively fail before we try
-    # resetting it (if virtual) or marking it builderok=False (if not).
-    RESET_THRESHOLD = 5
-
-    # The number of times a virtual builder can reach its reset threshold
-    # due to consecutive failures before we give up and mark it
-    # builderok=False.
-    RESET_FAILURE_THRESHOLD = 3
-
     def _getBuilderok(self):
         return self._builderok
 

=== modified file 'lib/lp/buildmaster/tests/test_interactor.py'
--- lib/lp/buildmaster/tests/test_interactor.py	2014-06-23 14:52:15 +0000
+++ lib/lp/buildmaster/tests/test_interactor.py	2014-06-24 10:22:57 +0000
@@ -161,14 +161,6 @@
         d = self.resumeSlaveHost(MockBuilder(virtualized=True, vm_host="pop"))
         return assert_fails_with(d, CannotResumeHost)
 
-    def test_resetOrFail_nonvirtual(self):
-        builder = MockBuilder(virtualized=False, builderok=True)
-        vitals = extract_vitals_from_db(builder)
-        self.assertFalse(
-            BuilderInteractor().resetOrFail(
-                vitals, None, builder, DevNullLogger(), Exception()))
-        self.assertFalse(builder.builderok)
-
     def test_makeSlaveFromVitals(self):
         # Builder.slave is a BuilderSlave that points at the actual Builder.
         # The Builder is only ever used in scripts that run outside of the

=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
--- lib/lp/buildmaster/tests/test_manager.py	2014-06-23 15:14:46 +0000
+++ lib/lp/buildmaster/tests/test_manager.py	2014-06-24 10:22:57 +0000
@@ -37,6 +37,7 @@
     )
 from lp.buildmaster.interfaces.builder import (
     BuildDaemonIsolationError,
+    BuildSlaveFailure,
     IBuilderSet,
     )
 from lp.buildmaster.interfaces.buildfarmjobbehaviour import (
@@ -44,15 +45,16 @@
     )
 from lp.buildmaster.interfaces.buildqueue import IBuildQueueSet
 from lp.buildmaster.manager import (
-    assessFailureCounts,
     BuilddManager,
+    BUILDER_FAILURE_THRESHOLD,
     BuilderFactory,
+    JOB_RESET_THRESHOLD,
     judge_failure,
     NewBuildersScanner,
     PrefetchedBuilderFactory,
+    recover_failure,
     SlaveScanner,
     )
-from lp.buildmaster.model.builder import Builder
 from lp.buildmaster.tests.harness import BuilddManagerTestSetup
 from lp.buildmaster.tests.mock_slaves import (
     BrokenSlave,
@@ -375,7 +377,7 @@
         scanner = self._getScanner()
         scanner.scan = failing_scan
         from lp.buildmaster import manager as manager_module
-        self.patch(manager_module, 'assessFailureCounts', FakeMethod())
+        self.patch(manager_module, 'recover_failure', FakeMethod())
         builder = getUtility(IBuilderSet)[scanner.builder_name]
 
         builder.failure_count = builder_count
@@ -396,7 +398,7 @@
         self.assertEqual(
             expected_job_count,
             builder.current_build.failure_count)
-        self.assertEqual(1, manager_module.assessFailureCounts.call_count)
+        self.assertEqual(1, manager_module.recover_failure.call_count)
 
     def test_scan_first_fail(self):
         # The first failure of a job should result in the failure_count
@@ -426,8 +428,7 @@
         scanner = self._getScanner()
         scanner.scan = failing_scan
         builder = getUtility(IBuilderSet)[scanner.builder_name]
-        builder.failure_count = (
-            Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD)
+        builder.failure_count = BUILDER_FAILURE_THRESHOLD
         builder.currentjob.reset()
         transaction.commit()
 
@@ -557,7 +558,7 @@
         # we should have also called the resume() method on the slave that
         # resets the virtual machine.
         clock.advance(SlaveScanner.CANCEL_TIMEOUT)
-        yield scanner.scan()
+        yield scanner.singleCycle()
         self.assertEqual(1, slave.call_log.count("abort"))
         self.assertEqual(BuilderCleanStatus.DIRTY, builder.clean_status)
         self.assertEqual(BuildStatus.CANCELLED, build.status)
@@ -892,8 +893,8 @@
         self.assertEqual(
             (None, None),
             judge_failure(
-                Builder.JOB_RESET_THRESHOLD - 1,
-                Builder.JOB_RESET_THRESHOLD - 1, Exception()))
+                JOB_RESET_THRESHOLD - 1, JOB_RESET_THRESHOLD - 1,
+                Exception()))
 
     def test_same_count_exceeding_threshold(self):
         # Several consecutive failures suggest that something might be
@@ -901,42 +902,30 @@
         self.assertEqual(
             (None, True),
             judge_failure(
-                Builder.JOB_RESET_THRESHOLD, Builder.JOB_RESET_THRESHOLD,
-                Exception()))
+                JOB_RESET_THRESHOLD, JOB_RESET_THRESHOLD, Exception()))
 
     def test_same_count_no_retries(self):
-        self.assertEqual(
-            (None, True),
-            judge_failure(
-                Builder.JOB_RESET_THRESHOLD - 1,
-                Builder.JOB_RESET_THRESHOLD - 1, Exception(), retry=False))
-
-    def test_bad_builder_below_threshold(self):
-        self.assertEqual(
-            (None, True),
-            judge_failure(Builder.RESET_THRESHOLD - 1, 1, Exception()))
-
-    def test_bad_builder_at_reset_threshold(self):
-        self.assertEqual(
-            (True, True),
-            judge_failure(Builder.RESET_THRESHOLD, 1, Exception()))
-
-    def test_bad_builder_above_reset_threshold(self):
-        self.assertEqual(
-            (None, True),
-            judge_failure(
-                Builder.RESET_THRESHOLD + 1, Builder.RESET_THRESHOLD,
-                Exception()))
-
-    def test_bad_builder_second_reset(self):
-        self.assertEqual(
-            (True, True),
-            judge_failure(Builder.RESET_THRESHOLD * 2, 1, Exception()))
+        # A single failure of both causes a job reset if retries are
+        # forbidden.
+        self.assertEqual(
+            (None, True),
+            judge_failure(
+                JOB_RESET_THRESHOLD - 1, JOB_RESET_THRESHOLD - 1, Exception(),
+                retry=False))
+
+    def test_bad_builder(self):
+        # A bad builder resets its job and dirties itself. The next scan
+        # will do what it can to recover it (resetting in the virtual
+        # case, or just retrying for non-virts).
+        self.assertEqual(
+            (True, True),
+            judge_failure(BUILDER_FAILURE_THRESHOLD - 1, 1, Exception()))
 
     def test_bad_builder_gives_up(self):
+        # A persistently bad builder resets its job and fails itself.
         self.assertEqual(
             (False, True),
-            judge_failure(Builder.RESET_THRESHOLD * 3, 1, Exception()))
+            judge_failure(BUILDER_FAILURE_THRESHOLD, 1, Exception()))
 
     def test_bad_job_fails(self):
         self.assertEqual(
@@ -960,7 +949,6 @@
         builder_name = BOB_THE_BUILDER_NAME
         self.builder = getUtility(IBuilderSet)[builder_name]
         self.builder.virtualized = True
-        self.interactor = BuilderInteractor()
 
     @property
     def vitals(self):
@@ -972,67 +960,56 @@
         scanner.logger.name = 'slave-scanner'
         return scanner
 
-    def test_ignores_nonvirtual(self):
-        # If the builder is nonvirtual make sure we return False.
-        self.builder.virtualized = False
-        d = self._getScanner().checkCancellation(
-            self.vitals, None, self.interactor)
-        return d.addCallback(self.assertFalse)
-
-    def test_ignores_no_buildqueue(self):
-        # If the builder has no buildqueue associated,
-        # make sure we return False.
-        buildqueue = self.builder.currentjob
-        buildqueue.reset()
-        d = self._getScanner().checkCancellation(
-            self.vitals, None, self.interactor)
-        return d.addCallback(self.assertFalse)
-
     def test_ignores_build_not_cancelling(self):
         # If the active build is not in a CANCELLING state, ignore it.
-        d = self._getScanner().checkCancellation(
-            self.vitals, None, self.interactor)
-        return d.addCallback(self.assertFalse)
-
-    @defer.inlineCallbacks
-    def test_cancelling_build_is_cancelled(self):
+        slave = BuildingSlave()
+        scanner = self._getScanner()
+        yield scanner.checkCancellation(self.vitals, slave)
+        self.assertEqual([], slave.call_log)
+
+    @defer.inlineCallbacks
+    def test_cancelling_build_is_aborted(self):
+        # The first time we see a CANCELLING build, we abort the slave.
+        slave = BuildingSlave()
+        self.builder.current_build.cancel()
+        scanner = self._getScanner()
+        yield scanner.checkCancellation(self.vitals, slave)
+        self.assertEqual(["abort"], slave.call_log)
+
+        # A further scan is a no-op, as we remember that we've already
+        # requested that the slave abort.
+        yield scanner.checkCancellation(self.vitals, slave)
+        self.assertEqual(["abort"], slave.call_log)
+
+    @defer.inlineCallbacks
+    def test_timed_out_cancel_errors(self):
         # If a BuildQueue is CANCELLING and the cancel timeout expires,
-        # make sure True is returned and the slave is dirty.
+        # an exception is raised so the normal scan error handler can
+        # finalise the build.
         slave = OkSlave()
-        self.builder.vm_host = "fake_vm_host"
-        buildqueue = self.builder.currentjob
-        build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(buildqueue)
+        build = self.builder.current_build
         build.cancel()
         clock = task.Clock()
         scanner = self._getScanner(clock=clock)
 
-        result = yield scanner.checkCancellation(
-            self.vitals, slave, self.interactor)
-        self.assertNotIn("resume", slave.call_log)
-        self.assertFalse(result)
+        yield scanner.checkCancellation(self.vitals, slave)
+        self.assertEqual(["abort"], slave.call_log)
         self.assertEqual(BuildStatus.CANCELLING, build.status)
 
         clock.advance(SlaveScanner.CANCEL_TIMEOUT)
-        result = yield scanner.checkCancellation(
-            self.vitals, slave, self.interactor)
-        self.assertEqual(BuilderCleanStatus.DIRTY, self.builder.clean_status)
-        self.assertTrue(result)
-        self.assertEqual(BuildStatus.CANCELLED, build.status)
+        with ExpectedException(
+                BuildSlaveFailure, "Timeout waiting for .* to cancel"):
+            yield scanner.checkCancellation(self.vitals, slave)
 
     @defer.inlineCallbacks
-    def test_lost_build_is_cancelled(self):
-        # If the builder reports a fault while attempting to abort it, then
-        # immediately resume the slave as if the cancel timeout had expired.
+    def test_failed_abort_errors(self):
+        # If the builder reports a fault while attempting to abort it,
+        # an exception is raised so the build can be finalised.
         slave = LostBuildingBrokenSlave()
-        self.builder.vm_host = "fake_vm_host"
-        buildqueue = self.builder.currentjob
-        build = getUtility(IBinaryPackageBuildSet).getByQueueEntry(buildqueue)
-        build.cancel()
-        result = yield self._getScanner().checkCancellation(
-            self.vitals, slave, self.interactor)
-        self.assertEqual(BuilderCleanStatus.DIRTY, self.builder.clean_status)
-        self.assertTrue(result)
-        self.assertEqual(BuildStatus.CANCELLED, build.status)
+        self.builder.current_build.cancel()
+        with ExpectedException(
+                xmlrpclib.Fault, "<Fault 8002: 'Could not abort'>"):
+            yield self._getScanner().checkCancellation(self.vitals, slave)
 
 
 class TestBuilddManager(TestCase):
@@ -1074,7 +1051,6 @@
 class TestFailureAssessments(TestCaseWithFactory):
 
     layer = ZopelessDatabaseLayer
-    run_tests_with = AsynchronousDeferredRunTest
 
     def setUp(self):
         TestCaseWithFactory.setUp(self)
@@ -1084,109 +1060,98 @@
         self.buildqueue.markAsBuilding(self.builder)
         self.slave = OkSlave()
 
-    def _assessFailureCounts(self, fail_notes, retry=True):
-        # Helper for assessFailureCounts boilerplate.
-        return assessFailureCounts(
+    def _recover_failure(self, fail_notes, retry=True):
+        # Helper for recover_failure boilerplate.
+        recover_failure(
             BufferLogger(), extract_vitals_from_db(self.builder), self.builder,
-            self.slave, BuilderInteractor(), retry, Exception(fail_notes))
+            retry, Exception(fail_notes))
 
-    @defer.inlineCallbacks
     def test_job_reset_threshold_with_retry(self):
         naked_build = removeSecurityProxy(self.build)
-        self.builder.failure_count = Builder.JOB_RESET_THRESHOLD - 1
-        naked_build.failure_count = Builder.JOB_RESET_THRESHOLD - 1
+        self.builder.failure_count = JOB_RESET_THRESHOLD - 1
+        naked_build.failure_count = JOB_RESET_THRESHOLD - 1
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertIsNot(None, self.builder.currentjob)
         self.assertEqual(self.build.status, BuildStatus.BUILDING)
 
         self.builder.failure_count += 1
         naked_build.failure_count += 1
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertIs(None, self.builder.currentjob)
         self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
 
-    @defer.inlineCallbacks
     def test_job_reset_threshold_no_retry(self):
         naked_build = removeSecurityProxy(self.build)
         self.builder.failure_count = 1
         naked_build.failure_count = 1
 
-        yield self._assessFailureCounts("failnotes", retry=False)
+        self._recover_failure("failnotes", retry=False)
         self.assertIs(None, self.builder.currentjob)
         self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
 
-    @defer.inlineCallbacks
+    def test_reset_during_cancellation_cancels(self):
+        self.buildqueue.cancel()
+        self.assertEqual(BuildStatus.CANCELLING, self.build.status)
+
+        naked_build = removeSecurityProxy(self.build)
+        self.builder.failure_count = 1
+        naked_build.failure_count = 1
+
+        self._recover_failure("failnotes")
+        self.assertIs(None, self.builder.currentjob)
+        self.assertEqual(BuildStatus.CANCELLED, self.build.status)
+
     def test_job_failing_more_than_builder_fails_job(self):
         self.build.gotFailure()
         self.build.gotFailure()
         self.builder.gotFailure()
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertIs(None, self.builder.currentjob)
         self.assertEqual(self.build.status, BuildStatus.FAILEDTOBUILD)
         self.assertEqual(0, self.builder.failure_count)
 
-    @defer.inlineCallbacks
-    def test_virtual_builder_reset_thresholds(self):
-        self.builder.virtualized = True
-        self.builder.vm_host = 'foo'
+    def test_failure_during_cancellation_cancels(self):
+        self.buildqueue.cancel()
+        self.assertEqual(BuildStatus.CANCELLING, self.build.status)
+
+        self.build.gotFailure()
+        self.build.gotFailure()
+        self.builder.gotFailure()
+        self._recover_failure("failnotes")
+        self.assertIs(None, self.builder.currentjob)
+        self.assertEqual(BuildStatus.CANCELLED, self.build.status)
+
+    def test_bad_builder(self):
         self.builder.setCleanStatus(BuilderCleanStatus.CLEAN)
 
-        for failure_count in range(
-            Builder.RESET_THRESHOLD - 1,
-            Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD):
-            self.builder.failure_count = failure_count
-            yield self._assessFailureCounts("failnotes")
-            self.assertIs(None, self.builder.currentjob)
-            self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
-            # The builder should be CLEAN until it hits the
-            # RESET_THRESHOLD.
-            if failure_count % Builder.RESET_THRESHOLD != 0:
-                self.assertEqual(
-                    BuilderCleanStatus.CLEAN, self.builder.clean_status)
-            else:
-                self.assertEqual(
-                    BuilderCleanStatus.DIRTY, self.builder.clean_status)
-                self.builder.setCleanStatus(BuilderCleanStatus.CLEAN)
-            self.assertTrue(self.builder.builderok)
-
-        self.builder.failure_count = (
-            Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD)
-        yield self._assessFailureCounts("failnotes")
-        self.assertIs(None, self.builder.currentjob)
-        self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
-        self.assertEqual(
-            BuilderCleanStatus.CLEAN, self.builder.clean_status)
-        self.assertFalse(self.builder.builderok)
-        self.assertEqual("failnotes", self.builder.failnotes)
-
-    @defer.inlineCallbacks
-    def test_non_virtual_builder_reset_thresholds(self):
-        self.builder.virtualized = False
-        self.builder.failure_count = Builder.RESET_THRESHOLD - 1
-        yield self._assessFailureCounts("failnotes")
-        self.assertIs(None, self.builder.currentjob)
-        self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
-        self.assertEqual(0, self.slave.call_log.count('resume'))
+        # The first few failures of a bad builder just reset the job and
+        # mark the builder dirty. The next scan will reset a virtual
+        # builder or attempt to clean up a non-virtual builder.
+        self.builder.failure_count = BUILDER_FAILURE_THRESHOLD - 1
+        self.assertIsNot(None, self.builder.currentjob)
+        self._recover_failure("failnotes")
+        self.assertIs(None, self.builder.currentjob)
+        self.assertEqual(BuilderCleanStatus.DIRTY, self.builder.clean_status)
         self.assertTrue(self.builder.builderok)
 
-        self.builder.failure_count = Builder.RESET_THRESHOLD
-        yield self._assessFailureCounts("failnotes")
+        # But if the builder continues to cause trouble, it will be
+        # disabled.
+        self.builder.failure_count = BUILDER_FAILURE_THRESHOLD
+        self.buildqueue.markAsBuilding(self.builder)
+        self._recover_failure("failnotes")
         self.assertIs(None, self.builder.currentjob)
-        self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
-        self.assertEqual(0, self.slave.call_log.count('resume'))
+        self.assertEqual(BuilderCleanStatus.DIRTY, self.builder.clean_status)
         self.assertFalse(self.builder.builderok)
         self.assertEqual("failnotes", self.builder.failnotes)
 
-    @defer.inlineCallbacks
     def test_builder_failing_with_no_attached_job(self):
         self.buildqueue.reset()
-        self.builder.failure_count = (
-            Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD)
+        self.builder.failure_count = BUILDER_FAILURE_THRESHOLD
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertFalse(self.builder.builderok)
         self.assertEqual("failnotes", self.builder.failnotes)