canonical-ubuntu-qa team mailing list archive

Thread
Date

[Merge] ~andersson123/autopkgtest-cloud:missing-tests into autopkgtest-cloud:master

To: mp+477440@xxxxxxxxxxxxxxxxxx
From: Tim Andersson <mp+477440@xxxxxxxxxxxxxxxxxx>
Date: Thu, 28 Nov 2024 14:42:21 -0000
Reply-to: mp+477440@xxxxxxxxxxxxxxxxxx
Sender: noreply@xxxxxxxxxxxxx

Tim Andersson has proposed merging ~andersson123/autopkgtest-cloud:missing-tests into autopkgtest-cloud:master.

Requested reviews:
  Canonical's Ubuntu QA (canonical-ubuntu-qa)

For more details, see:
https://code.launchpad.net/~andersson123/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/477440

Hopefully a fix for losing all these tests recently :/
-- 
Your team Canonical's Ubuntu QA is requested to review the proposed merge of ~andersson123/autopkgtest-cloud:missing-tests into autopkgtest-cloud:master.

diff --git a/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker b/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
index c281b0f..9483f8a 100755
--- a/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
+++ b/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
@@ -1401,16 +1401,28 @@ def request(msg):
                         msg.channel.basic_reject(
                             msg.delivery_tag, requeue=True
                         )
+                        kill_openstack_server(test_uuid)
+                        # return here so the worker can go back to listening for
+                        # test requests
+                        return
                     else:
+                        # Tim Andersson:
+                        # We've recently (as of 28/11/2024) been losing test requests.
+                        # This block was the cause - autopkgtest would exit with an abnormal exit code,
+                        # and this block would assume that an admin had intentionally killed the test,
+                        # causing the message to be removed from the queue. Prior to this logic that I mention here,
+                        # the test request would go back in the queue, causing the test to loop forever. The best option
+                        # here I believe is to count the failure as a "real" failure - then a.u.c admins can much more easily
+                        # investigate the issue, as the result will go into the database, and the log will be available
+                        # in the swift storage.
+                        # Setting retry to 3 causes this whole convoluted block to not execute again.
                         logging.warning(
-                            "autopkgtest failure not requested via systemd, removing message %s from queue",
+                            "autopkgtest has failed with an unknown code (%i), removing message %s from queue and counting as a real failure so admins can more easily debug the issue.",
+                            code,
                             body.encode(),
                         )
                         msg.channel.basic_ack(msg.delivery_tag)
-                    kill_openstack_server(test_uuid)
-                    # return here so the worker can go back to listening for
-                    # test requests
-                    return
+                        retry = 3
             else:
                 if num_failures >= 3:
                     logging.warning(