← Back to team overview

canonical-ubuntu-qa team mailing list archive

[Merge] ~andersson123/autopkgtest-cloud:fix-systemd-stop-restart-losing-jobs into autopkgtest-cloud:master

 

Tim Andersson has proposed merging ~andersson123/autopkgtest-cloud:fix-systemd-stop-restart-losing-jobs into autopkgtest-cloud:master.

Commit message:
fix: worker: actually save messages when service stopped by systemd
    
    Previously, we had an iteration of this functionality which instead
    utilised checking the exit code of the autopkgtest subprocess
    for a -15 code, utilising the mechanism detailed in [1].
    
    This was brittle - in the case the restart was executed at the time
    before the VM for the test was in BUILD state, the autopkgtest
    subprocess would exit with code 1 and the test request would be
    lost.
    
    This commit amends the issue by instead utilising the signal
    handlers in the worker code.
    
    This commit also re-introduces the documentation lost by the
    revert commit prior to this one.
    
    Fixes LP: #2067714
    
    [1] https://docs.python.org/3/library/subprocess.html#subprocess.CompletedProcess.returncode

Requested reviews:
  Canonical's Ubuntu QA (canonical-ubuntu-qa)
Related bugs:
  Bug #2067714 in Auto Package Testing: "systemctl restart autopkgtest@*.service still looses jobs"
  https://bugs.launchpad.net/auto-package-testing/+bug/2067714

For more details, see:
https://code.launchpad.net/~andersson123/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/466850

The latest change I made for this functionality was unfortunately, bunk. It didn't work as intended in the case that the test was systemctl restart'ed before the VM was in at least BUILD state. 

I've tested these new changes under the following scenarios:
- Before the VM is even in BUILD state
- When the VM is in BUILD state
- When the VM is in ACTIVE state

I believe this covers all the possible states we can encounter - aside from after ACTIVE state and the VM is already killed - though this is the same as prior to the VM being spawned.
-- 
Your team Canonical's Ubuntu QA is requested to review the proposed merge of ~andersson123/autopkgtest-cloud:fix-systemd-stop-restart-losing-jobs into autopkgtest-cloud:master.
diff --git a/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker b/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
index 2467a8b..32c8661 100755
--- a/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
+++ b/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/worker/worker
@@ -1367,33 +1367,26 @@ def request(msg):
                     # tests can be requested to be killed with:
                     # kill -15 $pid
                     logging.warning(
-                        "autopkgtest exited with code: %i",
+                        "autopkgtest exited with code: %i\nack-ing message: %s",
                         code,
+                        body.encode(),
                     )
                     running_test = False
                     # we ack the message so it doesn't go back in the queue
-                    if code == -15:
+                    if exit_requested is not None:
                         logging.info(
-                            "autopkgtest has been killed with SIGTERM, indicating a restart or stop, requeue-ing message %s"
-                            % body
+                            "An exit has been requested via systemd, placing message %s back in queue",
+                            body.encode(),
                         )
                         msg.channel.basic_reject(
                             msg.delivery_tag, requeue=True
                         )
-                    elif code == -9:
-                        logging.info(
-                            "autopkgtest has been killed intentionally, removing message %s from the queue"
-                            % body
-                        )
-                        msg.channel.basic_ack(msg.delivery_tag)
                     else:
-                        logging.info(
-                            "autopkgtest has hit an obscure failure mode, removing message %s from the queue"
-                            % body
-                        )
-                        msg.channel.basic_reject(
-                            msg.delivery_tag, requeue=True
+                        logging.warning(
+                            "autopkgtest failure not requested via systemd, removing message %s from queue",
+                            body.encode(),
                         )
+                        msg.channel.basic_ack(msg.delivery_tag)
                     logging.info(
                         "Killing openstack server with uuid %s",
                         test_uuid,
diff --git a/docs/administration.rst b/docs/administration.rst
index 2171554..5abbbe4 100644
--- a/docs/administration.rst
+++ b/docs/administration.rst
@@ -486,8 +486,10 @@ In order to kill a currently running test, grab the test uuid. This can be seen 
   ps aux | grep runner | grep $uuid
   # grab the PID from the process
   # If you want to kill the test and remove the test request from the queue:
+  # kill -9 sends SIGKILL, which isn't caught by signal handlers in the worker code.
   kill -9 $pid
   # If you want to kill the test and preserve the test request in the queue:
+  # kill -15 sends SIGTERM, which is caught by the signal handlers in the worker code.
   kill -15 $pid
 
 This will kill the autopkgtest process, and then the worker will `ack` the test request