← Back to team overview

canonical-ubuntu-qa team mailing list archive

[Bug 2061141] [NEW] Running out of `/tmp` space on cloud workers - post-mortem

 

Public bug reported:

The week from 2024-04-08 to 2024-04-12, we experienced a lot of "No space left
on device" error in a lot of different jobs. Here is a description of various
points regarding that issue, that can act as a kind of post-mortem, in the event
we face a similar situation again.
This is described from my own point of view, with my current understanding. I
don't pretend to understand every part of the problem, but I think I can give a
pretty wide overview.

This was mostly due to multiple things:
 1. the consequence of `time_t` and `xz-utils` just before the beta release of
    Noble led to huge autopkgtest queues that we needed to consume as fast as
    possible.
 2. we thus increased the number of workers running on our cloud units, after IS
    increased our quota.
 3. increasing the number of running jobs without increasing the size of the
    main working directory obviously can lead to disasters.
 4. this was amplified by a really bad timing, having in the queue in parallel
    a lot of tests for the following three packages: libreoffice, systemd, and
    llvm-toolchain-{15,16,17,18}. All those package require at least 1.5GB for
    their working directory on the cloud-worker.
 5. this was amplified by the worker sometimes failing to clean its working
    directory, leading to dangling folders only cleaned up after 30 days on our
    units.

This combination of things ended up in a whole week of regularly "fixing" the
issue, only to discover 12 hours later that it was still there, and digging
further, taking new actions, and getting more and more depressed when users
still came telling us there was more ENOSPACE to report.

Now here is the list of actions that were taken to remediate each of
those points:

 1. This was mostly the current context, and besides queue cleaning, there isn't
    much to be done.
 2. and 3. This was just a matter of reducing the number of jobs per worker:
    `n-workers` config taken from 110 to 90, for a 200GB /tmp. Value is still
    under observation.
 4. As the ceph-based `/tmp` folder isn't very fast, making `du` and `rm` very slow, it was required to have precise and very effective cleaning. Here are the two commands I ended with to remove `libreoffice` directories older than a day:
```
grep -H '^libreoffice ' /tmp/*/out/testpkg-version | cut -d'/' -f-3 > /tmp/tests
touch -d '1 day ago' /tmp/1.day.ago; df -i /tmp; df -h /tmp; for p in $(cat /tmp/tests); do if [ "$p" -ot /tmp/1.day.ago ]; then sudo rm -rf $p; fi done; df -i /tmp; df -h /tmp;
```
    This was quite effective to very quickly bring back a lot of free space
    and inodes.
 5. Was fixed by this MP:
    https://code.launchpad.net/~hyask/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/463993

All in all, pretty simple solutions, but the main problem really was about
investigating the multiple causes of the issues, and the cascading effect of
them, like when the worker throws while removing its working directory, thus
failing to delete it entirely, leading to next runs having less and less space.

** Affects: auto-package-testing
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of
Canonical's Ubuntu QA, which is subscribed to Auto Package Testing.
https://bugs.launchpad.net/bugs/2061141

Title:
  Running out of `/tmp` space on cloud workers - post-mortem

Status in Auto Package Testing:
  New

Bug description:
  The week from 2024-04-08 to 2024-04-12, we experienced a lot of "No space left
  on device" error in a lot of different jobs. Here is a description of various
  points regarding that issue, that can act as a kind of post-mortem, in the event
  we face a similar situation again.
  This is described from my own point of view, with my current understanding. I
  don't pretend to understand every part of the problem, but I think I can give a
  pretty wide overview.

  This was mostly due to multiple things:
   1. the consequence of `time_t` and `xz-utils` just before the beta release of
      Noble led to huge autopkgtest queues that we needed to consume as fast as
      possible.
   2. we thus increased the number of workers running on our cloud units, after IS
      increased our quota.
   3. increasing the number of running jobs without increasing the size of the
      main working directory obviously can lead to disasters.
   4. this was amplified by a really bad timing, having in the queue in parallel
      a lot of tests for the following three packages: libreoffice, systemd, and
      llvm-toolchain-{15,16,17,18}. All those package require at least 1.5GB for
      their working directory on the cloud-worker.
   5. this was amplified by the worker sometimes failing to clean its working
      directory, leading to dangling folders only cleaned up after 30 days on our
      units.

  This combination of things ended up in a whole week of regularly "fixing" the
  issue, only to discover 12 hours later that it was still there, and digging
  further, taking new actions, and getting more and more depressed when users
  still came telling us there was more ENOSPACE to report.

  Now here is the list of actions that were taken to remediate each of
  those points:

   1. This was mostly the current context, and besides queue cleaning, there isn't
      much to be done.
   2. and 3. This was just a matter of reducing the number of jobs per worker:
      `n-workers` config taken from 110 to 90, for a 200GB /tmp. Value is still
      under observation.
   4. As the ceph-based `/tmp` folder isn't very fast, making `du` and `rm` very slow, it was required to have precise and very effective cleaning. Here are the two commands I ended with to remove `libreoffice` directories older than a day:
  ```
  grep -H '^libreoffice ' /tmp/*/out/testpkg-version | cut -d'/' -f-3 > /tmp/tests
  touch -d '1 day ago' /tmp/1.day.ago; df -i /tmp; df -h /tmp; for p in $(cat /tmp/tests); do if [ "$p" -ot /tmp/1.day.ago ]; then sudo rm -rf $p; fi done; df -i /tmp; df -h /tmp;
  ```
      This was quite effective to very quickly bring back a lot of free space
      and inodes.
   5. Was fixed by this MP:
      https://code.launchpad.net/~hyask/autopkgtest-cloud/+git/autopkgtest-cloud/+merge/463993

  All in all, pretty simple solutions, but the main problem really was about
  investigating the multiple causes of the issues, and the cascading effect of
  them, like when the worker throws while removing its working directory, thus
  failing to delete it entirely, leading to next runs having less and less space.

To manage notifications about this bug go to:
https://bugs.launchpad.net/auto-package-testing/+bug/2061141/+subscriptions



Follow ups