← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1750780] Re: Race with local file systems can make open-vm-tools fail to start

 

Installed another Xenial and Bionic in vmware to take a deper look.
- Xenial (with backported open-vm-tools): affected
- Bionic (with the interim fix reverted): no hit in several retries, explanation below

Systemd fixed it (via our assumed implicit dependency).
In Bionic the PrivateTmp gives it a dependency on systemd-tmpfile-setup.service (seen in systemd analyze, there might be more but not on crit path).
This is configured by default to include /var/tmp in /usr/lib/tmpfiles.d/tmp.conf.

In regard to your thoughts about later on changing cloud-init ordering
that won't help you, as the dependency is there (implicit or explicit
doesn't matter).

For the xenial case where I reliably hit the issue instead of stracing I cut things short.
A service with the following exposes exactly the same error:
[Unit]
Description=foo
DefaultDependencies=no

[Service]
PrivateTmp=yes
ExecStart=/bin/true

[Install]
WantedBy=multi-user.target

So back on Xenial it is privateTmp + too early that breaks it.

Xenial vs Bionic critical-chain according to "systemd-analyze critical-
chain open.vm-tools.service"

Xenial with fix:
open-vm-tools.service @3.482s
└─local-fs.target @3.460s
  └─local-fs-pre.target @3.460s
    └─systemd-remount-fs.service @3.442s +9ms
      └─system.slice @220ms
        └─-.slice @204m

Xenial without fix:
└─run-vmblock\x2dfuse.mount @6.076s +390ms
  └─sys-fs-fuse-connections.mount @5.510s +375ms
    └─systemd-modules-load.service @1.996s +75ms
      └─system.slice @1.984s
        └─-.slice @1.966s

Bionic
open-vm-tools.service @3.566s
└─systemd-tmpfiles-setup.service @3.421s +100ms
  └─systemd-journal-flush.service @3.054s +342ms
    └─systemd-journald.service @825ms +2.219s
      └─syslog.socket @808ms
        └─system.slice @621ms
          └─-.slice @613ms

To Summarize, we can:
- revert the fix for Bionic (or later) - just make it a sync when convenient down the road, it doesn't hurt for now as it is (almost) the same as the implicit dependency)
- add a xenials systemd bug task (probably too complex to fix as -upstream)
- until said systemd bug is fixed a backport of open-vm-tools needs this fix


** Also affects: systemd (Ubuntu)
   Importance: Undecided
       Status: New

** Also affects: open-vm-tools (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** Also affects: systemd (Ubuntu Xenial)
   Importance: Undecided
       Status: New

** Changed in: open-vm-tools (Ubuntu Xenial)
       Status: New => Triaged

** Changed in: systemd (Ubuntu)
       Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1750780

Title:
  Race with local file systems can make open-vm-tools fail to start

Status in cloud-init:
  Invalid
Status in open-vm-tools package in Ubuntu:
  Fix Released
Status in systemd package in Ubuntu:
  Fix Released
Status in open-vm-tools source package in Xenial:
  Triaged
Status in systemd source package in Xenial:
  New
Status in open-vm-tools package in Debian:
  Incomplete

Bug description:
  Since the change in [1] open-vm-tools-service starts very (very) early.
  Not so much due to the 
  Before=cloud-init-local.service
  But much more by
  DefaultDependencies=no

  That can trigger an issue that looks like
  root@ubuntuguest:~# systemctl status -l open-vm-tools.service
  ● open-vm-tools.service - Service for virtual machines hosted on VMware
     Loaded: loaded (/lib/systemd/system/open-vm-tools.service; enabled; vendor preset: enabled)
     Active: failed (Result: resources)

  
  As it is right now open-vm-tools can race with the other early start and then fail.
  In detail one can find a message like:
    open-vm-tools.service: Failed to run 'start' task: Read-only file system"

  This is due to privtaeTmp=yes which is also set needing a writable
  /var/tmp [2]

  To ensure this works PrivateTmp would have to be removed (not good) or some after dependencies added that make this work reliably.
  I added
  After=local-fs.target
  which made it work for me in 3/3 tests.

  I' like to have an ack by the cloud-init Team that this does not totally kill the originally intended Before=cloud-init-local.service
  I think it does not as local-fs can complete before cloud-init-local, then open-vm-tools can initialize and finally cloud-init-local can pick up the data.

  To summarize:
  # cloud-init-local #
  DefaultDependencies=no
  Wants=network-pre.target
  After=systemd-remount-fs.service
  Before=NetworkManager.service
  Before=network-pre.target
  Before=shutdown.target
  Before=sysinit.target
  Conflicts=shutdown.target
  RequiresMountsFor=/var/lib/cloud

  # open-vm-tools #
  DefaultDependencies=no
  Before=cloud-init-local.service

  Proposed is to add to the latter:
  After=local-fs.target

  [1]: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=859677
  [2]: https://github.com/systemd/systemd/issues/5610

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1750780/+subscriptions


References