canonical-ubuntu-qa team mailing list archive
-
canonical-ubuntu-qa team
-
Mailing list archive
-
Message #03753
[Bug 2061141] [NEW] Running out of `/tmp` space on cloud workers - post-mortem
Public bug reported:
The week from 2024-04-08 to 2024-04-12, we experienced a lot of "No space left
on device" error in a lot of different jobs. Here is a description of various
points regarding that issue, that can act as a kind of post-mortem, in the event
we face a similar situation again.
This is described from my own point of view, with my current understanding. I
don't pretend to understand every part of the problem, but I think I can give a
pretty wide overview.
This was mostly due to multiple things:
1. the consequence of `time_t` and `xz-utils` just before the beta release of
Noble led to huge autopkgtest queues that we needed to consume as fast as
possible.
2. we thus increased the number of workers running on our cloud units, after IS
increased our quota.
3. increasing the number of running jobs without increasing the size of the
main working directory obviously can lead to disasters.
4. this was amplified by a really bad timing, having in the queue in parallel
a lot of tests for the following three packages: libreoffice, systemd, and
llvm-toolchain-{15,16,17,18}. All those package require at least 1.5GB for
their working directory on the cloud-worker.
5. this was amplified by the worker sometimes failing to clean its working
directory, leading to dangling folders only cleaned up after 30 days on our
units.
This combination of things ended up in a whole week of regularly "fixing" the
issue, only to discover 12 hours later that it was still there, and digging
further, taking new actions, and getting more and more depressed when users
still came telling us there was more ENOSPACE to report.
Now here is the list of actions that were taken to remediate each of
those points:
1. This was mostly the current context, and besides queue cleaning, there isn't
much to be done.
2. and 3. This was just a matter of reducing the number of jobs per worker:
`n-workers` config taken from 110 to 90, for a 200GB /tmp. Value is still
under observation.
4. As the ceph-based `/tmp` folder isn't very fast, making `du` and `rm` very slow, it was required to have precise and very effective cleaning. Here are the two commands I ended with to remove `libreoffice` directories older than a day:
```
grep -H '^libreoffice ' /tmp/*/out/testpkg-version | cut -d'/' -f-3 > /tmp/tests
touch -d '1 day ago' /tmp/1.day.ago; df -i /tmp; df -h /tmp; for p in $(cat /tmp/tests); do if [ "$p" -ot /tmp/1.day.ago ]; then sudo rm -rf $p; fi done; df -i /tmp; df -h /tmp;
```
This was quite effective to very quickly bring back a lot of free space
and inodes.
5. Was fixed by this MP:
https://code.launchpad.net/~hyask/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/463993
All in all, pretty simple solutions, but the main problem really was about
investigating the multiple causes of the issues, and the cascading effect of
them, like when the worker throws while removing its working directory, thus
failing to delete it entirely, leading to next runs having less and less space.
** Affects: auto-package-testing
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of
Canonical's Ubuntu QA, which is subscribed to Auto Package Testing.
https://bugs.launchpad.net/bugs/2061141
Title:
Running out of `/tmp` space on cloud workers - post-mortem
Status in Auto Package Testing:
New
Bug description:
The week from 2024-04-08 to 2024-04-12, we experienced a lot of "No space left
on device" error in a lot of different jobs. Here is a description of various
points regarding that issue, that can act as a kind of post-mortem, in the event
we face a similar situation again.
This is described from my own point of view, with my current understanding. I
don't pretend to understand every part of the problem, but I think I can give a
pretty wide overview.
This was mostly due to multiple things:
1. the consequence of `time_t` and `xz-utils` just before the beta release of
Noble led to huge autopkgtest queues that we needed to consume as fast as
possible.
2. we thus increased the number of workers running on our cloud units, after IS
increased our quota.
3. increasing the number of running jobs without increasing the size of the
main working directory obviously can lead to disasters.
4. this was amplified by a really bad timing, having in the queue in parallel
a lot of tests for the following three packages: libreoffice, systemd, and
llvm-toolchain-{15,16,17,18}. All those package require at least 1.5GB for
their working directory on the cloud-worker.
5. this was amplified by the worker sometimes failing to clean its working
directory, leading to dangling folders only cleaned up after 30 days on our
units.
This combination of things ended up in a whole week of regularly "fixing" the
issue, only to discover 12 hours later that it was still there, and digging
further, taking new actions, and getting more and more depressed when users
still came telling us there was more ENOSPACE to report.
Now here is the list of actions that were taken to remediate each of
those points:
1. This was mostly the current context, and besides queue cleaning, there isn't
much to be done.
2. and 3. This was just a matter of reducing the number of jobs per worker:
`n-workers` config taken from 110 to 90, for a 200GB /tmp. Value is still
under observation.
4. As the ceph-based `/tmp` folder isn't very fast, making `du` and `rm` very slow, it was required to have precise and very effective cleaning. Here are the two commands I ended with to remove `libreoffice` directories older than a day:
```
grep -H '^libreoffice ' /tmp/*/out/testpkg-version | cut -d'/' -f-3 > /tmp/tests
touch -d '1 day ago' /tmp/1.day.ago; df -i /tmp; df -h /tmp; for p in $(cat /tmp/tests); do if [ "$p" -ot /tmp/1.day.ago ]; then sudo rm -rf $p; fi done; df -i /tmp; df -h /tmp;
```
This was quite effective to very quickly bring back a lot of free space
and inodes.
5. Was fixed by this MP:
https://code.launchpad.net/~hyask/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/463993
All in all, pretty simple solutions, but the main problem really was about
investigating the multiple causes of the issues, and the cascading effect of
them, like when the worker throws while removing its working directory, thus
failing to delete it entirely, leading to next runs having less and less space.
To manage notifications about this bug go to:
https://bugs.launchpad.net/auto-package-testing/+bug/2061141/+subscriptions
Follow ups