← Back to team overview

canonical-ubuntu-qa team mailing list archive

[Bug 1988080] [NEW] cloud-worker-maintenance can hang

 

Public bug reported:

The cloud-worker-maintenance job appeared to be stuck with the following
in journalctl:

Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3162016]: Error: Stopping the instance failed: websocket: close 1006 (abnormal closure): unexpected EOF
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: lxd-armhf-10.44.124.124:autopkgtest-lxd-cyynbq is old - deleting
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: Traceback (most recent call last):
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:   File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 59, in <module>
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:     main()
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:   File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 55, in main
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:     check_remote(remote)
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:   File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 40, in check_remote
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:     subprocess.check_call(
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:   File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:     raise CalledProcessError(retcode, cmd)
Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: subprocess.CalledProcessError: Command '['lxc', 'delete', '--force', 'lxd-armhf-10.44.124.124:autopkgtest-lxd-cyynbq']' ret

To workaround the failure we can restart the service and if it works
again and if that does not work delete the broken container and reboot
the host.

To stop it from happening again Julian suggested adding a
"TimeoutSec=1h" to cloud-worker-maintenance as a minimum. Ideally the
delete call would have a 10 minute timeout with a wrapper for subprocess
that handles the the timeout.

** Affects: auto-package-testing
     Importance: High
         Status: New

** Changed in: auto-package-testing
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of
Canonical's Ubuntu QA, which is subscribed to Auto Package Testing.
https://bugs.launchpad.net/bugs/1988080

Title:
  cloud-worker-maintenance can hang

Status in Auto Package Testing:
  New

Bug description:
  The cloud-worker-maintenance job appeared to be stuck with the
  following in journalctl:

  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3162016]: Error: Stopping the instance failed: websocket: close 1006 (abnormal closure): unexpected EOF
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: lxd-armhf-10.44.124.124:autopkgtest-lxd-cyynbq is old - deleting
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: Traceback (most recent call last):
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:   File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 59, in <module>
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:     main()
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:   File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 55, in main
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:     check_remote(remote)
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:   File "/home/ubuntu/autopkgtest-cloud/tools/cleanup-lxd", line 40, in check_remote
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:     subprocess.check_call(
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:   File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]:     raise CalledProcessError(retcode, cmd)
  Aug 27 16:58:12 juju-4d1272-prod-proposed-migration-5 cloud-worker-maintenance[3161610]: subprocess.CalledProcessError: Command '['lxc', 'delete', '--force', 'lxd-armhf-10.44.124.124:autopkgtest-lxd-cyynbq']' ret

  To workaround the failure we can restart the service and if it works
  again and if that does not work delete the broken container and reboot
  the host.

  To stop it from happening again Julian suggested adding a
  "TimeoutSec=1h" to cloud-worker-maintenance as a minimum. Ideally the
  delete call would have a 10 minute timeout with a wrapper for
  subprocess that handles the the timeout.

To manage notifications about this bug go to:
https://bugs.launchpad.net/auto-package-testing/+bug/1988080/+subscriptions



Follow ups