← Back to team overview

launchpad-reviewers team mailing list archive

[Merge] lp:~wgrant/launchpad/resetOrFail-pointless into lp:launchpad

 

William Grant has proposed merging lp:~wgrant/launchpad/resetOrFail-pointless into lp:launchpad with lp:~wgrant/launchpad/checkCancellation-nicify as a prerequisite.

Commit message:
Clean up buildd-manager failure handling, and fix it to only attempt to recover virtual builders by resetting up to four times, not 14.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)

For more details, see:
https://code.launchpad.net/~wgrant/launchpad/resetOrFail-pointless/+merge/224262

This branch further simplifies buildd-manager error handling. The only significant change is that virtual builders will now fail after four resets instead of 14.

In the old days (like, say, a week ago), assessFailureCounts would retry a scan five times, then reset a virtual builder or fail a non-virtual one. In the virtual case, it would reset after five and 10 failures, and only fail after 15. But Since SlaveScanner.scan() now respects Builder.clean_status by resetting DIRTY virtual builders at the start of a scan, assessFailureCounts actually ends up implicitly resetting the builder four times, then explicitly once, then implicitly another four times, explicitly once, then implicitly another four, then finally failing the builder on the 15th scan.

Resetting 14 times is probably slightly excessive, so I've simplified the logic to just retry the scan four times and then fail, regardless of virtness. In the virt case this try cause four resets before failure, while the non-virt case will, as always, just have the four chances to magically come good.

This actually simplifies things considerably. assessFailureCounts and _scanFailed no longer return Deferreds, and resetOrFail dies entirely. I also renamed assessFailureCounts to recover_failure, to meet our non-member function naming style and also be more accurate.
-- 
https://code.launchpad.net/~wgrant/launchpad/resetOrFail-pointless/+merge/224262
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~wgrant/launchpad/resetOrFail-pointless into lp:launchpad.
=== modified file 'lib/lp/buildmaster/interactor.py'
--- lib/lp/buildmaster/interactor.py	2014-06-23 14:52:15 +0000
+++ lib/lp/buildmaster/interactor.py	2014-06-24 10:26:31 +0000
@@ -380,45 +380,6 @@
         yield behaviour.dispatchBuildToSlave(build_queue_item.id, logger)
 
     @classmethod
-    def resetOrFail(cls, vitals, slave, builder, logger, exception):
-        """Handle "confirmed" build slave failures.
-
-        Call this when there have been multiple failures that are not just
-        the fault of failing jobs, or when the builder has entered an
-        ABORTED state without having been asked to do so.
-
-        In case of a virtualized builder we ensure it's dirty so the
-        next scan will reset it.
-
-        A non-virtualized builder will be failed straightaway (using
-        `failBuilder`).
-
-        :param logger: The logger object to be used for logging.
-        :param exception: An exception to be used for logging.
-        :return: True if the builder is to be resumed, None otherwise.
-        """
-        error_message = str(exception)
-        if vitals.virtualized:
-            # Virtualized/PPA builder: mark it dirty so the next
-            # scan will attempt a reset.
-            logger.warn(
-                "Dirtying builder: %s -- %s" % (vitals.url, error_message),
-                exc_info=True)
-            if builder.clean_status != BuilderCleanStatus.DIRTY:
-                builder.setCleanStatus(BuilderCleanStatus.DIRTY)
-                transaction.commit()
-            return True
-        else:
-            # XXX: This should really let the failure bubble up to the
-            # scan() method that does the failure counting.
-            # Mark builder as 'failed'.
-            logger.warn(
-                "Disabling builder: %s -- %s" % (vitals.url, error_message))
-            builder.failBuilder(error_message)
-            transaction.commit()
-            return False
-
-    @classmethod
     @defer.inlineCallbacks
     def findAndStartJob(cls, vitals, builder, slave):
         """Find a job to run and send it to the buildd slave.

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py	2014-06-24 10:26:31 +0000
+++ lib/lp/buildmaster/manager.py	2014-06-24 10:26:31 +0000
@@ -52,6 +52,15 @@
 BUILDD_MANAGER_LOG_NAME = "slave-scanner"
 
 
+# The number of times a builder can consecutively fail before we
+# reset its current job.
+JOB_RESET_THRESHOLD = 3
+
+# The number of times a builder can consecutively fail before we
+# mark it builderok=False.
+BUILDER_FAILURE_THRESHOLD = 5
+
+
 class BuilderFactory:
     """A dumb builder factory that just talks to the DB."""
 
@@ -156,22 +165,19 @@
         # fault, it'll error on the next builder and fail out. If the
         # builder is at fault, the job will work fine next time, and the
         # builder will error on the next job and fail out.
-        if not retry or builder_count >= Builder.JOB_RESET_THRESHOLD:
+        if not retry or builder_count >= JOB_RESET_THRESHOLD:
             return (None, True)
     elif builder_count > job_count:
         # The builder has failed more than the job, so the builder is at
-        # fault. We reset the job, and attempt to recover the builder.
-        # We retry a few times, then reset the builder. Then try a few
-        # more times, then reset it again. Then try yet again, after
-        # which we fail.
-        # XXX: Rescanning a virtual builder without resetting is stupid,
-        # as it will be reset anyway due to being dirty and idle.
-        if (builder_count >=
-                Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD):
+        # fault. We reset the job and attempt to recover the builder.
+        if builder_count < BUILDER_FAILURE_THRESHOLD:
+            # Let's dirty the builder and give it a few cycles to
+            # recover. Since it's dirty and idle, this will
+            # automatically attempt a reset if virtual.
+            return (True, True)
+        else:
+            # We've retried too many times, so fail the builder.
             return (False, True)
-        elif builder_count % Builder.RESET_THRESHOLD == 0:
-            return (True, True)
-        return (None, True)
     else:
         # The job has failed more than the builder. Fail it.
         return (None, False)
@@ -180,20 +186,17 @@
     return (None, None)
 
 
-@defer.inlineCallbacks
-def assessFailureCounts(logger, vitals, builder, slave, interactor, retry,
-                        exception):
-    """View builder/job failure_count and work out which needs to die.
-
-    :return: A Deferred that fires either immediately or after a virtual
-        slave has been reset.
-    """
+def recover_failure(logger, vitals, builder, retry, exception):
+    """Recover from a scan failure by slapping the builder or job."""
     del get_property_cache(builder).currentjob
     job = builder.currentjob
 
     # If a job is being cancelled we won't bother retrying a failure.
     # Just mark it as cancelled and clear the builder for normal cleanup.
     cancelling = job is not None and job.status == BuildQueueStatus.CANCELLING
+
+    # judge_failure decides who is guilty and their sentences. We're
+    # just the executioner.
     builder_action, job_action = judge_failure(
         builder.failure_count, job.specific_build.failure_count if job else 0,
         exception, retry=retry and not cancelling)
@@ -220,11 +223,10 @@
         # little choice but to give up.
         builder.failBuilder(str(exception))
     elif builder_action == True:
-        # The builder is dead, but in the virtual case it might be worth
-        # resetting it.
-        yield interactor.resetOrFail(
-            vitals, slave, builder, logger, exception)
-    del get_property_cache(builder).currentjob
+        # Dirty the builder to attempt recovery. In the virtual case,
+        # the dirty idleness will cause a reset, giving us a good chance
+        # of recovery.
+        builder.setCleanStatus(BuilderCleanStatus.DIRTY)
 
 
 class SlaveScanner:
@@ -304,16 +306,12 @@
     def _updateDateScanned(self, ignored):
         self.date_scanned = datetime.datetime.utcnow()
 
-    @defer.inlineCallbacks
     def _scanFailed(self, retry, failure):
         """Deal with failures encountered during the scan cycle.
 
         1. Print the error in the log
         2. Increment and assess failure counts on the builder and job.
            If asked to retry, a single failure may not be considered fatal.
-
-        :return: A Deferred that fires either immediately or after a virtual
-            slave has been reset.
         """
         # Make sure that pending database updates are removed as it
         # could leave the database in an inconsistent state (e.g. The
@@ -338,9 +336,7 @@
         builder = self.builder_factory[self.builder_name]
         try:
             builder.handleFailure(self.logger)
-            yield assessFailureCounts(
-                self.logger, vitals, builder, self.slave_factory(vitals),
-                self.interactor_factory(), retry, failure.value)
+            recover_failure(self.logger, vitals, builder, retry, failure.value)
             transaction.commit()
         except Exception:
             # Catastrophic code failure! Not much we can do.

=== modified file 'lib/lp/buildmaster/model/builder.py'
--- lib/lp/buildmaster/model/builder.py	2014-06-20 11:52:32 +0000
+++ lib/lp/buildmaster/model/builder.py	2014-06-24 10:26:31 +0000
@@ -107,19 +107,6 @@
     vm_reset_protocol = EnumCol(enum=BuilderResetProtocol)
     date_clean_status_changed = UtcDateTimeCol()
 
-    # The number of times a builder can consecutively fail before we
-    # reset its current job.
-    JOB_RESET_THRESHOLD = 3
-
-    # The number of times a builder can consecutively fail before we try
-    # resetting it (if virtual) or marking it builderok=False (if not).
-    RESET_THRESHOLD = 5
-
-    # The number of times a virtual builder can reach its reset threshold
-    # due to consecutive failures before we give up and mark it
-    # builderok=False.
-    RESET_FAILURE_THRESHOLD = 3
-
     def _getBuilderok(self):
         return self._builderok
 

=== modified file 'lib/lp/buildmaster/tests/test_interactor.py'
--- lib/lp/buildmaster/tests/test_interactor.py	2014-06-23 14:52:15 +0000
+++ lib/lp/buildmaster/tests/test_interactor.py	2014-06-24 10:26:31 +0000
@@ -161,14 +161,6 @@
         d = self.resumeSlaveHost(MockBuilder(virtualized=True, vm_host="pop"))
         return assert_fails_with(d, CannotResumeHost)
 
-    def test_resetOrFail_nonvirtual(self):
-        builder = MockBuilder(virtualized=False, builderok=True)
-        vitals = extract_vitals_from_db(builder)
-        self.assertFalse(
-            BuilderInteractor().resetOrFail(
-                vitals, None, builder, DevNullLogger(), Exception()))
-        self.assertFalse(builder.builderok)
-
     def test_makeSlaveFromVitals(self):
         # Builder.slave is a BuilderSlave that points at the actual Builder.
         # The Builder is only ever used in scripts that run outside of the

=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
--- lib/lp/buildmaster/tests/test_manager.py	2014-06-24 10:26:31 +0000
+++ lib/lp/buildmaster/tests/test_manager.py	2014-06-24 10:26:31 +0000
@@ -45,15 +45,16 @@
     )
 from lp.buildmaster.interfaces.buildqueue import IBuildQueueSet
 from lp.buildmaster.manager import (
-    assessFailureCounts,
     BuilddManager,
+    BUILDER_FAILURE_THRESHOLD,
     BuilderFactory,
+    JOB_RESET_THRESHOLD,
     judge_failure,
     NewBuildersScanner,
     PrefetchedBuilderFactory,
+    recover_failure,
     SlaveScanner,
     )
-from lp.buildmaster.model.builder import Builder
 from lp.buildmaster.tests.harness import BuilddManagerTestSetup
 from lp.buildmaster.tests.mock_slaves import (
     BrokenSlave,
@@ -376,7 +377,7 @@
         scanner = self._getScanner()
         scanner.scan = failing_scan
         from lp.buildmaster import manager as manager_module
-        self.patch(manager_module, 'assessFailureCounts', FakeMethod())
+        self.patch(manager_module, 'recover_failure', FakeMethod())
         builder = getUtility(IBuilderSet)[scanner.builder_name]
 
         builder.failure_count = builder_count
@@ -397,7 +398,7 @@
         self.assertEqual(
             expected_job_count,
             builder.current_build.failure_count)
-        self.assertEqual(1, manager_module.assessFailureCounts.call_count)
+        self.assertEqual(1, manager_module.recover_failure.call_count)
 
     def test_scan_first_fail(self):
         # The first failure of a job should result in the failure_count
@@ -427,8 +428,7 @@
         scanner = self._getScanner()
         scanner.scan = failing_scan
         builder = getUtility(IBuilderSet)[scanner.builder_name]
-        builder.failure_count = (
-            Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD)
+        builder.failure_count = BUILDER_FAILURE_THRESHOLD
         builder.currentjob.reset()
         transaction.commit()
 
@@ -893,8 +893,8 @@
         self.assertEqual(
             (None, None),
             judge_failure(
-                Builder.JOB_RESET_THRESHOLD - 1,
-                Builder.JOB_RESET_THRESHOLD - 1, Exception()))
+                JOB_RESET_THRESHOLD - 1, JOB_RESET_THRESHOLD - 1,
+                Exception()))
 
     def test_same_count_exceeding_threshold(self):
         # Several consecutive failures suggest that something might be
@@ -902,42 +902,30 @@
         self.assertEqual(
             (None, True),
             judge_failure(
-                Builder.JOB_RESET_THRESHOLD, Builder.JOB_RESET_THRESHOLD,
-                Exception()))
+                JOB_RESET_THRESHOLD, JOB_RESET_THRESHOLD, Exception()))
 
     def test_same_count_no_retries(self):
-        self.assertEqual(
-            (None, True),
-            judge_failure(
-                Builder.JOB_RESET_THRESHOLD - 1,
-                Builder.JOB_RESET_THRESHOLD - 1, Exception(), retry=False))
-
-    def test_bad_builder_below_threshold(self):
-        self.assertEqual(
-            (None, True),
-            judge_failure(Builder.RESET_THRESHOLD - 1, 1, Exception()))
-
-    def test_bad_builder_at_reset_threshold(self):
-        self.assertEqual(
-            (True, True),
-            judge_failure(Builder.RESET_THRESHOLD, 1, Exception()))
-
-    def test_bad_builder_above_reset_threshold(self):
-        self.assertEqual(
-            (None, True),
-            judge_failure(
-                Builder.RESET_THRESHOLD + 1, Builder.RESET_THRESHOLD,
-                Exception()))
-
-    def test_bad_builder_second_reset(self):
-        self.assertEqual(
-            (True, True),
-            judge_failure(Builder.RESET_THRESHOLD * 2, 1, Exception()))
+        # A single failure of both causes a job reset if retries are
+        # forbidden.
+        self.assertEqual(
+            (None, True),
+            judge_failure(
+                JOB_RESET_THRESHOLD - 1, JOB_RESET_THRESHOLD - 1, Exception(),
+                retry=False))
+
+    def test_bad_builder(self):
+        # A bad builder resets its job and dirties itself. The next scan
+        # will do what it can to recover it (resetting in the virtual
+        # case, or just retrying for non-virts).
+        self.assertEqual(
+            (True, True),
+            judge_failure(BUILDER_FAILURE_THRESHOLD - 1, 1, Exception()))
 
     def test_bad_builder_gives_up(self):
+        # A persistently bad builder resets its job and fails itself.
         self.assertEqual(
             (False, True),
-            judge_failure(Builder.RESET_THRESHOLD * 3, 1, Exception()))
+            judge_failure(BUILDER_FAILURE_THRESHOLD, 1, Exception()))
 
     def test_bad_job_fails(self):
         self.assertEqual(
@@ -1063,7 +1051,6 @@
 class TestFailureAssessments(TestCaseWithFactory):
 
     layer = ZopelessDatabaseLayer
-    run_tests_with = AsynchronousDeferredRunTest
 
     def setUp(self):
         TestCaseWithFactory.setUp(self)
@@ -1073,40 +1060,37 @@
         self.buildqueue.markAsBuilding(self.builder)
         self.slave = OkSlave()
 
-    def _assessFailureCounts(self, fail_notes, retry=True):
-        # Helper for assessFailureCounts boilerplate.
-        return assessFailureCounts(
+    def _recover_failure(self, fail_notes, retry=True):
+        # Helper for recover_failure boilerplate.
+        recover_failure(
             BufferLogger(), extract_vitals_from_db(self.builder), self.builder,
-            self.slave, BuilderInteractor(), retry, Exception(fail_notes))
+            retry, Exception(fail_notes))
 
-    @defer.inlineCallbacks
     def test_job_reset_threshold_with_retry(self):
         naked_build = removeSecurityProxy(self.build)
-        self.builder.failure_count = Builder.JOB_RESET_THRESHOLD - 1
-        naked_build.failure_count = Builder.JOB_RESET_THRESHOLD - 1
+        self.builder.failure_count = JOB_RESET_THRESHOLD - 1
+        naked_build.failure_count = JOB_RESET_THRESHOLD - 1
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertIsNot(None, self.builder.currentjob)
         self.assertEqual(self.build.status, BuildStatus.BUILDING)
 
         self.builder.failure_count += 1
         naked_build.failure_count += 1
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertIs(None, self.builder.currentjob)
         self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
 
-    @defer.inlineCallbacks
     def test_job_reset_threshold_no_retry(self):
         naked_build = removeSecurityProxy(self.build)
         self.builder.failure_count = 1
         naked_build.failure_count = 1
 
-        yield self._assessFailureCounts("failnotes", retry=False)
+        self._recover_failure("failnotes", retry=False)
         self.assertIs(None, self.builder.currentjob)
         self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
 
-    @defer.inlineCallbacks
     def test_reset_during_cancellation_cancels(self):
         self.buildqueue.cancel()
         self.assertEqual(BuildStatus.CANCELLING, self.build.status)
@@ -1115,22 +1099,20 @@
         self.builder.failure_count = 1
         naked_build.failure_count = 1
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertIs(None, self.builder.currentjob)
         self.assertEqual(BuildStatus.CANCELLED, self.build.status)
 
-    @defer.inlineCallbacks
     def test_job_failing_more_than_builder_fails_job(self):
         self.build.gotFailure()
         self.build.gotFailure()
         self.builder.gotFailure()
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertIs(None, self.builder.currentjob)
         self.assertEqual(self.build.status, BuildStatus.FAILEDTOBUILD)
         self.assertEqual(0, self.builder.failure_count)
 
-    @defer.inlineCallbacks
     def test_failure_during_cancellation_cancels(self):
         self.buildqueue.cancel()
         self.assertEqual(BuildStatus.CANCELLING, self.build.status)
@@ -1138,69 +1120,38 @@
         self.build.gotFailure()
         self.build.gotFailure()
         self.builder.gotFailure()
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertIs(None, self.builder.currentjob)
         self.assertEqual(BuildStatus.CANCELLED, self.build.status)
 
-    @defer.inlineCallbacks
-    def test_virtual_builder_reset_thresholds(self):
-        self.builder.virtualized = True
-        self.builder.vm_host = 'foo'
+    def test_bad_builder(self):
         self.builder.setCleanStatus(BuilderCleanStatus.CLEAN)
 
-        for failure_count in range(
-            Builder.RESET_THRESHOLD - 1,
-            Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD):
-            self.builder.failure_count = failure_count
-            yield self._assessFailureCounts("failnotes")
-            self.assertIs(None, self.builder.currentjob)
-            self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
-            # The builder should be CLEAN until it hits the
-            # RESET_THRESHOLD.
-            if failure_count % Builder.RESET_THRESHOLD != 0:
-                self.assertEqual(
-                    BuilderCleanStatus.CLEAN, self.builder.clean_status)
-            else:
-                self.assertEqual(
-                    BuilderCleanStatus.DIRTY, self.builder.clean_status)
-                self.builder.setCleanStatus(BuilderCleanStatus.CLEAN)
-            self.assertTrue(self.builder.builderok)
-
-        self.builder.failure_count = (
-            Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD)
-        yield self._assessFailureCounts("failnotes")
-        self.assertIs(None, self.builder.currentjob)
-        self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
-        self.assertEqual(
-            BuilderCleanStatus.CLEAN, self.builder.clean_status)
-        self.assertFalse(self.builder.builderok)
-        self.assertEqual("failnotes", self.builder.failnotes)
-
-    @defer.inlineCallbacks
-    def test_non_virtual_builder_reset_thresholds(self):
-        self.builder.virtualized = False
-        self.builder.failure_count = Builder.RESET_THRESHOLD - 1
-        yield self._assessFailureCounts("failnotes")
-        self.assertIs(None, self.builder.currentjob)
-        self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
-        self.assertEqual(0, self.slave.call_log.count('resume'))
+        # The first few failures of a bad builder just reset the job and
+        # mark the builder dirty. The next scan will reset a virtual
+        # builder or attempt to clean up a non-virtual builder.
+        self.builder.failure_count = BUILDER_FAILURE_THRESHOLD - 1
+        self.assertIsNot(None, self.builder.currentjob)
+        self._recover_failure("failnotes")
+        self.assertIs(None, self.builder.currentjob)
+        self.assertEqual(BuilderCleanStatus.DIRTY, self.builder.clean_status)
         self.assertTrue(self.builder.builderok)
 
-        self.builder.failure_count = Builder.RESET_THRESHOLD
-        yield self._assessFailureCounts("failnotes")
+        # But if the builder continues to cause trouble, it will be
+        # disabled.
+        self.builder.failure_count = BUILDER_FAILURE_THRESHOLD
+        self.buildqueue.markAsBuilding(self.builder)
+        self._recover_failure("failnotes")
         self.assertIs(None, self.builder.currentjob)
-        self.assertEqual(self.build.status, BuildStatus.NEEDSBUILD)
-        self.assertEqual(0, self.slave.call_log.count('resume'))
+        self.assertEqual(BuilderCleanStatus.DIRTY, self.builder.clean_status)
         self.assertFalse(self.builder.builderok)
         self.assertEqual("failnotes", self.builder.failnotes)
 
-    @defer.inlineCallbacks
     def test_builder_failing_with_no_attached_job(self):
         self.buildqueue.reset()
-        self.builder.failure_count = (
-            Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD)
+        self.builder.failure_count = BUILDER_FAILURE_THRESHOLD
 
-        yield self._assessFailureCounts("failnotes")
+        self._recover_failure("failnotes")
         self.assertFalse(self.builder.builderok)
         self.assertEqual("failnotes", self.builder.failnotes)
 


Follow ups