canonical-ubuntu-qa team mailing list archive

Thread
Date

[Bug 2061141] [NEW] Running out of `/tmp` space on cloud workers - post-mortem

To: canonical-ubuntu-qa@xxxxxxxxxxxxxxxxxxx
From: Skia <2061141@xxxxxxxxxxxxxxxxxx>
Date: Fri, 12 Apr 2024 13:33:10 -0000
Reply-to: Bug 2061141 <2061141@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

The week from 2024-04-08 to 2024-04-12, we experienced a lot of "No space left
on device" error in a lot of different jobs. Here is a description of various
points regarding that issue, that can act as a kind of post-mortem, in the event
we face a similar situation again.
This is described from my own point of view, with my current understanding. I
don't pretend to understand every part of the problem, but I think I can give a
pretty wide overview.

This was mostly due to multiple things:
1. the consequence of `time_t` and `xz-utils` just before the beta release of
Noble led to huge autopkgtest queues that we needed to consume as fast as
possible.
2. we thus increased the number of workers running on our cloud units, after IS
increased our quota.
3. increasing the number of running jobs without increasing the size of the
main working directory obviously can lead to disasters.
4. this was amplified by a really bad timing, having in the queue in parallel
a lot of tests for the following three packages: libreoffice, systemd, and
llvm-toolchain-{15,16,17,18}. All those package require at least 1.5GB for
their working directory on the cloud-worker.
5. this was amplified by the worker sometimes failing to clean its working
directory, leading to dangling folders only cleaned up after 30 days on our
units.

This combination of things ended up in a whole week of regularly "fixing" the
issue, only to discover 12 hours later that it was still there, and digging
further, taking new actions, and getting more and more depressed when users
still came telling us there was more ENOSPACE to report.

Now here is the list of actions that were taken to remediate each of
those points:

1. This was mostly the current context, and besides queue cleaning, there isn't
much to be done.
2. and 3. This was just a matter of reducing the number of jobs per worker:
`n-workers` config taken from 110 to 90, for a 200GB /tmp. Value is still
under observation.
4. As the ceph-based `/tmp` folder isn't very fast, making `du` and `rm` very slow, it was required to have precise and very effective cleaning. Here are the two commands I ended with to remove `libreoffice` directories older than a day:
```
grep -H '^libreoffice ' /tmp/*/out/testpkg-version | cut -d'/' -f-3 > /tmp/tests
touch -d '1 day ago' /tmp/1.day.ago; df -i /tmp; df -h /tmp; for p in $(cat /tmp/tests); do if [ "$p" -ot /tmp/1.day.ago ]; then sudo rm -rf $p; fi done; df -i /tmp; df -h /tmp;
```
This was quite effective to very quickly bring back a lot of free space
and inodes.
5. Was fixed by this MP:
https://code.launchpad.net/~hyask/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/463993

All in all, pretty simple solutions, but the main problem really was about
investigating the multiple causes of the issues, and the cascading effect of
them, like when the worker throws while removing its working directory, thus
failing to delete it entirely, leading to next runs having less and less space.

** Affects: auto-package-testing
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of
Canonical's Ubuntu QA, which is subscribed to Auto Package Testing.
https://bugs.launchpad.net/bugs/2061141

Title:
Running out of `/tmp` space on cloud workers - post-mortem

Status in Auto Package Testing:
New

Bug description:
The week from 2024-04-08 to 2024-04-12, we experienced a lot of "No space left
on device" error in a lot of different jobs. Here is a description of various
points regarding that issue, that can act as a kind of post-mortem, in the event
we face a similar situation again.
This is described from my own point of view, with my current understanding. I
don't pretend to understand every part of the problem, but I think I can give a
pretty wide overview.

Now here is the list of actions that were taken to remediate each of
those points:

To manage notifications about this bug go to:
https://bugs.launchpad.net/auto-package-testing/+bug/2061141/+subscriptions

Follow ups

[Bug 2061141] Re: Running out of `/tmp` space on cloud workers - post-mortem
From: Skia, 2024-04-18
[Bug 2061141] Re: Running out of `/tmp` space on cloud workers - post-mortem
From: Tim Andersson, 2024-04-18
[Bug 2061141] Re: Running out of `/tmp` space on cloud workers - post-mortem
From: Skia, 2024-04-16