← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1806701] [NEW] cloud-init may hang OS boot process due to grep for the entire ISO file when it is attached

 

Public bug reported:

We have found in our test for SLES15 with cloud-init installed, if we
attach a ISO file with the VM before VM is boot, it often takes more
than 10 minutes to start the SLES OS. Sometimes it failed to start the
SLES OS at all.

We've root caused it is due to the  "is_cdrom_ovf()" func of "tools/ds-
identify".

In this function, there is the following logic to detect if an ISO contains certain string:
>    local idstr="http://schemas.dmtf.org/ovf/environment/1";
>    grep --quiet --ignore-case "$idstr" "${PATH_ROOT}$dev"
ref: https://git.launchpad.net/cloud-init/tree/tools/ds-identify

It is trying to grep the who ISO file for a certain string, which causes intense IO pressure for the system.
What is worse is that sometimes the ISO file is large (e.g. >5GB for installer DVD) and it is mounted over NFS. The "grep" process often consume 99% CPU and seems hang. Then the systemd starts more and more "grep" process which smoke the CPU and consumes all the IO bandwidth for the ISO file. Then the system may hang for a long time and sometimes failed to start.

To fix this issue, I suggest that we should not grep for the entire ISO
file. Rather then we should just check if the file/dir exists with
os.path.exists().


-------------------------debug log snip------------------------
pek2-gosv-16-dhcp180:~ # ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 13:32 ?        00:00:04 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
…
root       474     1  0 13:34 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root       482   474  2 13:34 ?        00:00:15 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root      1020     1  0 13:35 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root      1039  1020  1 13:35 ?        00:00:07 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
polkitd   1049     1  0 13:37 ?        00:00:00 /usr/lib/polkit-1/polkitd --no-debug
root      1051     1  0 13:37 ?        00:00:00 /usr/sbin/wickedd --systemd --foreground
root      1052     1  0 13:37 ?        00:00:00 /usr/lib/systemd/systemd-logind
root      1054     1  0 13:37 ?        00:00:00 /usr/sbin/wickedd-nanny --systemd --foreground
root      1073     1  0 13:37 ?        00:00:00 /usr/bin/vmtoolsd
root      1097     1  0 13:37 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root      1110  1097  1 13:37 ?        00:00:04 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root      1304     1  0 13:38 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root      1312  1304  1 13:38 ?        00:00:03 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root      1537     1  0 13:40 ?        00:00:00 /usr/bin/plymouth --wait
root      1613     1  0 13:40 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root      1645  1613  0 13:40 ?        00:00:02 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
…


Grep use nearly 100% cpu, system very slow.

top - 13:46:37 up 26 min,  2 users,  load average: 14.14, 15.03, 10.57
Tasks: 225 total,   6 running, 219 sleeping,   0 stopped,   0 zombie
%Cpu(s): 40.1 us, 49.3 sy,  0.0 ni,  0.0 id,  1.4 wa,  0.0 hi,  9.1 si,  0.0 st
KiB Mem :  1000916 total,    64600 free,   355880 used,   580436 buff/cache
KiB Swap:  1288168 total,  1285600 free,     2568 used.   492688 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
4427 root      20   0   40100   3940   3084 R 99.90 0.394   0:27.41 top
1016 root      20   0  197796   4852   3400 R 99.90 0.485   1:26.44 vmtoolsd
1723 root      20   0    7256   1860   1556 D 99.90 0.186   0:28.44 grep
  484 root      20   0    7256   1684   1396 D 99.90 0.168   1:51.22 grep
1278 root      20   0    7256   1856   1556 D 99.90 0.185   0:38.44 grep
1398 root      20   0    7256   1860   1556 R 99.90 0.186   0:28.53 grep
1061 root      20   0    7256   1856   1556 D 99.90 0.185   0:56.62 grep
-------------------------debug log snip------------------------

** Affects: cloud-init
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1806701

Title:
  cloud-init may hang OS boot process due to grep for the entire ISO
  file when it is attached

Status in cloud-init:
  New

Bug description:
  We have found in our test for SLES15 with cloud-init installed, if we
  attach a ISO file with the VM before VM is boot, it often takes more
  than 10 minutes to start the SLES OS. Sometimes it failed to start the
  SLES OS at all.

  We've root caused it is due to the  "is_cdrom_ovf()" func of "tools
  /ds-identify".

  In this function, there is the following logic to detect if an ISO contains certain string:
  >    local idstr="http://schemas.dmtf.org/ovf/environment/1";
  >    grep --quiet --ignore-case "$idstr" "${PATH_ROOT}$dev"
  ref: https://git.launchpad.net/cloud-init/tree/tools/ds-identify

  It is trying to grep the who ISO file for a certain string, which causes intense IO pressure for the system.
  What is worse is that sometimes the ISO file is large (e.g. >5GB for installer DVD) and it is mounted over NFS. The "grep" process often consume 99% CPU and seems hang. Then the systemd starts more and more "grep" process which smoke the CPU and consumes all the IO bandwidth for the ISO file. Then the system may hang for a long time and sometimes failed to start.

  To fix this issue, I suggest that we should not grep for the entire
  ISO file. Rather then we should just check if the file/dir exists with
  os.path.exists().

  
  -------------------------debug log snip------------------------
  pek2-gosv-16-dhcp180:~ # ps -ef
  UID        PID  PPID  C STIME TTY          TIME CMD
  root         1     0  0 13:32 ?        00:00:04 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
  …
  root       474     1  0 13:34 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
  root       482   474  2 13:34 ?        00:00:15 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
  root      1020     1  0 13:35 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
  root      1039  1020  1 13:35 ?        00:00:07 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
  polkitd   1049     1  0 13:37 ?        00:00:00 /usr/lib/polkit-1/polkitd --no-debug
  root      1051     1  0 13:37 ?        00:00:00 /usr/sbin/wickedd --systemd --foreground
  root      1052     1  0 13:37 ?        00:00:00 /usr/lib/systemd/systemd-logind
  root      1054     1  0 13:37 ?        00:00:00 /usr/sbin/wickedd-nanny --systemd --foreground
  root      1073     1  0 13:37 ?        00:00:00 /usr/bin/vmtoolsd
  root      1097     1  0 13:37 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
  root      1110  1097  1 13:37 ?        00:00:04 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
  root      1304     1  0 13:38 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
  root      1312  1304  1 13:38 ?        00:00:03 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
  root      1537     1  0 13:40 ?        00:00:00 /usr/bin/plymouth --wait
  root      1613     1  0 13:40 ?        00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
  root      1645  1613  0 13:40 ?        00:00:02 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
  …

  
  Grep use nearly 100% cpu, system very slow.

  top - 13:46:37 up 26 min,  2 users,  load average: 14.14, 15.03, 10.57
  Tasks: 225 total,   6 running, 219 sleeping,   0 stopped,   0 zombie
  %Cpu(s): 40.1 us, 49.3 sy,  0.0 ni,  0.0 id,  1.4 wa,  0.0 hi,  9.1 si,  0.0 st
  KiB Mem :  1000916 total,    64600 free,   355880 used,   580436 buff/cache
  KiB Swap:  1288168 total,  1285600 free,     2568 used.   492688 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  4427 root      20   0   40100   3940   3084 R 99.90 0.394   0:27.41 top
  1016 root      20   0  197796   4852   3400 R 99.90 0.485   1:26.44 vmtoolsd
  1723 root      20   0    7256   1860   1556 D 99.90 0.186   0:28.44 grep
    484 root      20   0    7256   1684   1396 D 99.90 0.168   1:51.22 grep
  1278 root      20   0    7256   1856   1556 D 99.90 0.185   0:38.44 grep
  1398 root      20   0    7256   1860   1556 R 99.90 0.186   0:28.53 grep
  1061 root      20   0    7256   1856   1556 D 99.90 0.185   0:56.62 grep
  -------------------------debug log snip------------------------

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1806701/+subscriptions


Follow ups