launchpad-reviewers team mailing list archive

Thread
Date

[Merge] lp:~wgrant/launchpad/re-assessFailureCounts into lp:launchpad

To: mp+224159@xxxxxxxxxxxxxxxxxx
From: William Grant <me@xxxxxxxxxxxxxxxxxx>
Date: Mon, 23 Jun 2014 15:33:26 -0000
Reply-to: mp+224159@xxxxxxxxxxxxxxxxxx
Sender: bounces@xxxxxxxxxxxxx

William Grant has proposed merging lp:~wgrant/launchpad/re-assessFailureCounts into lp:launchpad.

Commit message:
Extract the complex logic from assessFailureCounts into a new judge_failure function.

Requested reviews:
  Launchpad code reviewers (launchpad-reviewers)

For more details, see:
https://code.launchpad.net/~wgrant/launchpad/re-assessFailureCounts/+merge/224159

This branch extracts the complex logic from assessFailureCounts into a new judge_failure function. judge_failure examines the given failure counts and returns the actions that should be taken for the builder and job: fail (False), reset (True), or nothing (None). assessFailureCounts then executes those actions. It's now easier to adjust the failure logic without duplicating or refactoring job reset etc. code, and the tests are much nicer.

resetOrFail also stopped returning a Deferred, and its result is True/False rather than True/None.

This is all part of a total refactoring of buildd-manager error handling.
-- 
https://code.launchpad.net/~wgrant/launchpad/re-assessFailureCounts/+merge/224159
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~wgrant/launchpad/re-assessFailureCounts into lp:launchpad.

=== modified file 'lib/lp/buildmaster/interactor.py'
--- lib/lp/buildmaster/interactor.py	2014-06-20 13:08:19 +0000
+++ lib/lp/buildmaster/interactor.py	2014-06-23 15:32:50 +0000
@@ -394,8 +394,7 @@
 
         :param logger: The logger object to be used for logging.
         :param exception: An exception to be used for logging.
-        :return: A Deferred that fires after the virtual slave was resumed
-            or immediately if it's a non-virtual slave.
+        :return: True if the builder is to be resumed, None otherwise.
         """
         error_message = str(exception)
         if vitals.virtualized:
@@ -407,7 +406,7 @@
             if builder.clean_status != BuilderCleanStatus.DIRTY:
                 builder.setCleanStatus(BuilderCleanStatus.DIRTY)
                 transaction.commit()
-            return defer.succeed(True)
+            return True
         else:
             # XXX: This should really let the failure bubble up to the
             # scan() method that does the failure counting.
@@ -416,7 +415,7 @@
                 "Disabling builder: %s -- %s" % (vitals.url, error_message))
             builder.failBuilder(error_message)
             transaction.commit()
-        return defer.succeed(None)
+            return False
 
     @classmethod
     @defer.inlineCallbacks

=== modified file 'lib/lp/buildmaster/manager.py'
--- lib/lp/buildmaster/manager.py	2014-06-20 11:56:17 +0000
+++ lib/lp/buildmaster/manager.py	2014-06-23 15:32:50 +0000
@@ -131,6 +131,49 @@
         return (b for n, b in sorted(self.vitals_map.iteritems()))
 
 
+def judge_failure(builder_count, job_count, retry=True):
+    """Judge how to recover from a scan failure.
+
+    Assesses the failure counts of a builder and its current job, and
+    determines the best course of action for recovery.
+
+    :param: builder_count: Count of consecutive failures of the builder.
+    :param: job_count: Count of consecutive failures of the job.
+    :param: exc: Exception that caused the failure, if any.
+    :param: retry: Whether to retry a few times without taking action.
+    :return: A tuple of (builder action, job action). True means reset,
+        False means fail, None means take no action.
+    """
+    if builder_count == job_count:
+        # We can't tell which is to blame. Retry a few times, and then
+        # reset the job so it can be retried elsewhere. If the job is at
+        # fault, it'll error on the next builder and fail out. If the
+        # builder is at fault, the job will work fine next time, and the
+        # builder will error on the next job and fail out.
+        if not retry or builder_count >= Builder.JOB_RESET_THRESHOLD:
+            return (None, True)
+    elif builder_count > job_count:
+        # The builder has failed more than the job, so the builder is at
+        # fault. We reset the job, and attempt to recover the builder.
+        # We retry a few times, then reset the builder. Then try a few
+        # more times, then reset it again. Then try yet again, after
+        # which we fail.
+        # XXX: Rescanning a virtual builder without resetting is stupid,
+        # as it will be reset anyway due to being dirty and idle.
+        if (builder_count >=
+                Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD):
+            return (False, True)
+        elif builder_count % Builder.RESET_THRESHOLD == 0:
+            return (True, True)
+        return (None, True)
+    else:
+        # The job has failed more than the builder. Fail it.
+        return (None, False)
+
+    # Just retry.
+    return (None, None)
+
+
 @defer.inlineCallbacks
 def assessFailureCounts(logger, vitals, builder, slave, interactor, retry,
                         exception):
@@ -139,65 +182,31 @@
     :return: A Deferred that fires either immediately or after a virtual
         slave has been reset.
     """
-    # builder.currentjob hides a complicated query, don't run it twice.
-    # See bug 623281 (Note that currentjob is a cachedproperty).
-
     del get_property_cache(builder).currentjob
-    current_job = builder.currentjob
-    if current_job is None:
-        job_failure_count = 0
-    else:
-        job_failure_count = current_job.specific_build.failure_count
-
-    if builder.failure_count == job_failure_count and current_job is not None:
-        # If the failure count for the builder is the same as the
-        # failure count for the job being built, then we cannot
-        # tell whether the job or the builder is at fault. We retry the
-        # scan a few times, but once we give up the best we can do is
-        # reset the job and hope it runs against a different builder,
-        # giving us a judgement on which is at fault.
-        if not retry or builder.failure_count >= Builder.JOB_RESET_THRESHOLD:
-            current_job.reset()
-            del get_property_cache(builder).currentjob
-        return
-
-    if builder.failure_count > job_failure_count:
-        # The builder has failed more than the jobs it's been
-        # running.
-
-        # Re-schedule the build if there is one.
-        if current_job is not None:
-            current_job.reset()
-
-        # We are a little more tolerant with failing builders than
-        # failing jobs because sometimes they get unresponsive due to
-        # human error, flaky networks etc.  We expect the builder to get
-        # better, whereas jobs are very unlikely to get better.
-        if builder.failure_count >= (
-                Builder.RESET_THRESHOLD * Builder.RESET_FAILURE_THRESHOLD):
-            # We've already tried resetting it enough times, so we have
-            # little choice but to give up.
-            builder.failBuilder(str(exception))
-        elif builder.failure_count % Builder.RESET_THRESHOLD == 0:
-            # The builder is dead, but in the virtual case it might be worth
-            # resetting it.
-            yield interactor.resetOrFail(
-                vitals, slave, builder, logger, exception)
-    else:
-        # The job is the culprit!  Override its status to 'failed'
-        # to make sure it won't get automatically dispatched again,
-        # and remove the buildqueue request.  The failure should
-        # have already caused any relevant slave data to be stored
-        # on the build record so don't worry about that here.
-        builder.resetFailureCount()
-        build_job = current_job.specific_build
-        build_job.updateStatus(BuildStatus.FAILEDTOBUILD)
-        builder.currentjob.destroySelf()
-
-        # N.B. We could try and call _handleStatus_PACKAGEFAIL here
-        # but that would cause us to query the slave for its status
-        # again, and if the slave is non-responsive it holds up the
-        # next buildd scan.
+    job = builder.currentjob
+    builder_action, job_action = judge_failure(
+        builder.failure_count, job.specific_build.failure_count if job else 0,
+        retry=retry)
+
+    if job is not None:
+        if job_action == False:
+            # Fail and dequeue the job.
+            builder.resetFailureCount()
+            job.specific_build.updateStatus(BuildStatus.FAILEDTOBUILD)
+            job.destroySelf()
+        elif job_action == True:
+            # Reset the job so it will be retried elsewhere.
+            job.reset()
+
+    if builder_action == False:
+        # We've already tried resetting it enough times, so we have
+        # little choice but to give up.
+        builder.failBuilder(str(exception))
+    elif builder_action == True:
+        # The builder is dead, but in the virtual case it might be worth
+        # resetting it.
+        yield interactor.resetOrFail(
+            vitals, slave, builder, logger, exception)
     del get_property_cache(builder).currentjob
 
 
@@ -374,11 +383,10 @@
             self.date_cancel = None
             vitals.build_queue.markAsCancelled()
             transaction.commit()
-            value = yield interactor.resetOrFail(
+            resumed = yield interactor.resetOrFail(
                 vitals, slave, self.builder_factory[vitals.name], self.logger,
                 e)
-            # value is not None if we resumed a slave host.
-            defer.returnValue(value is not None)
+            defer.returnValue(resumed)
 
     def getExpectedCookie(self, vitals):
         """Return the build cookie expected to be held by the slave.

=== modified file 'lib/lp/buildmaster/tests/test_interactor.py'
--- lib/lp/buildmaster/tests/test_interactor.py	2014-06-20 13:08:19 +0000
+++ lib/lp/buildmaster/tests/test_interactor.py	2014-06-23 15:32:50 +0000
@@ -161,12 +161,12 @@
         d = self.resumeSlaveHost(MockBuilder(virtualized=True, vm_host="pop"))
         return assert_fails_with(d, CannotResumeHost)
 
-    @defer.inlineCallbacks
     def test_resetOrFail_nonvirtual(self):
         builder = MockBuilder(virtualized=False, builderok=True)
         vitals = extract_vitals_from_db(builder)
-        yield BuilderInteractor().resetOrFail(
-            vitals, None, builder, DevNullLogger(), Exception())
+        self.assertFalse(
+            BuilderInteractor().resetOrFail(
+                vitals, None, builder, DevNullLogger(), Exception()))
         self.assertFalse(builder.builderok)
 
     def test_makeSlaveFromVitals(self):

=== modified file 'lib/lp/buildmaster/tests/test_manager.py'
--- lib/lp/buildmaster/tests/test_manager.py	2014-06-20 13:08:19 +0000
+++ lib/lp/buildmaster/tests/test_manager.py	2014-06-23 15:32:50 +0000
@@ -47,6 +47,7 @@
     assessFailureCounts,
     BuilddManager,
     BuilderFactory,
+    judge_failure,
     NewBuildersScanner,
     PrefetchedBuilderFactory,
     SlaveScanner,
@@ -862,6 +863,57 @@
         assertCounts((0, 2, 2))
 
 
+class TestJudgeFailure(TestCase):
+
+    def test_same_count_below_threshold(self):
+        # A few consecutive failures aren't any cause for alarm, as it
+        # could just be a network glitch.
+        self.assertEqual(
+            (None, None),
+            judge_failure(
+                Builder.JOB_RESET_THRESHOLD - 1,
+                Builder.JOB_RESET_THRESHOLD - 1))
+
+    def test_same_count_exceeding_threshold(self):
+        # Several consecutive failures suggest that something might be
+        # up. The job is retried elsewhere.
+        self.assertEqual(
+            (None, True),
+            judge_failure(
+                Builder.JOB_RESET_THRESHOLD, Builder.JOB_RESET_THRESHOLD))
+
+    def test_bad_builder_below_threshold(self):
+        self.assertEqual(
+            (None, True),
+            judge_failure(Builder.RESET_THRESHOLD - 1, 1))
+
+    def test_bad_builder_at_reset_threshold(self):
+        self.assertEqual(
+            (True, True),
+            judge_failure(Builder.RESET_THRESHOLD, 1))
+
+    def test_bad_builder_above_reset_threshold(self):
+        self.assertEqual(
+            (None, True),
+            judge_failure(
+                Builder.RESET_THRESHOLD + 1, Builder.RESET_THRESHOLD))
+
+    def test_bad_builder_second_reset(self):
+        self.assertEqual(
+            (True, True),
+            judge_failure(Builder.RESET_THRESHOLD * 2, 1))
+
+    def test_bad_builder_gives_up(self):
+        self.assertEqual(
+            (False, True),
+            judge_failure(Builder.RESET_THRESHOLD * 3, 1))
+
+    def test_bad_job_fails(self):
+        self.assertEqual(
+            (None, False),
+            judge_failure(1, 2))
+
+
 class TestCancellationChecking(TestCaseWithFactory):
     """Unit tests for the checkCancellation method."""

Follow ups

[Merge] lp:~wgrant/launchpad/re-assessFailureCounts into lp:launchpad
From: noreply, 2014-06-24
Re: [Merge] lp:~wgrant/launchpad/re-assessFailureCounts into lp:launchpad
From: Celso Providelo, 2014-06-23