← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1892851] Re: Staged boot, to fix integration of systemd generators

 

@slyon

I have disscussed multi-transaction boot with systemd upstream; and
cloud-init developers.

Overall, it's an expensive operation, that may cause the boot slower,
and may have unintended consequences which will be harder to debug.

If more needs to add units during boot arise, imho we should do similar
to what was done in netplan to simply start/add units to the current
transaction whenever possible. As that is quick.

** Changed in: cloud-init
       Status: Confirmed => Invalid

** Changed in: netplan
       Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1892851

Title:
  Staged boot, to fix integration of systemd generators

Status in cloud-init:
  Invalid
Status in netplan:
  Fix Committed
Status in netplan.io package in Ubuntu:
  New

Bug description:
  [Intro]
  Cloud-init makes use of the "netplan" systemd generator, but calls "netplan generate" manually at runtime, while currently executing the initial systemd boot transaction, instead of running it as intended via "systemctl daemon-reload" at systemd generator stage, due to restrictions it has regarding fetching of its data source (e.g. netplan YAML config).

  [Problem]
  This leads to problems at first boot, as the systemd unit dependencies are calculated after the generator stage, but ahead of the boot transaction (e.g. via systemctl daemon-reload), therefore the new service units and its dependencies, which are generated by manually calling systemd generators are ignored during the first-boot transaction. In subsequent boots (where the cloud-init data source, netplan YAML config and unit files are already in place), everything works as expected.

  It is a tricky situation, as cloud-init
   1/ does not have the full config to run the systemd generators (e.g. netplan YAML) yet before the systemd boot transaction. It first needs to fetch it via a DataSource, possibly via a network connection.
   2/ cannot execute the generators manually (e.g. "netplan generate") during the systemd boot transaction, because this way the newly generated service units and corresponding dependencies will be ignored.
   3/ cannot re-execute the systemd generators after the initial boot transaction, as it is already too late at this point and applications expect to have a readily configured network setup after cloud-final.target has been reached.

  [References]
  Such problems have been reported and discussed for WiFi on RaspberryPi (LP: #1870346) or Open vSwitch setups in MAAS (https://github.com/CanonicalLtd/netplan/pull/157), where some of the generated service units/dependencies (netplan-ovs-*.service or netplan-wpa-*.service, possibly SR-IOV units as well...) are not properly executed on first boot.

  [Suggestion]
  A possible solution I discussed with @xnox would be to re-engineer how cloud-init targets work a bit, by splitting up the cloud-init boot sequence into multiple stages, e.g.:

  * Start "Stage 0" systemd transaction: systemctl isolate cloud-stage0.target
    - execute the init local modules
    - setup basic networking (DHCP on eth0/ens3)
    - fetch data source & place netplan YAML in /etc/netplan/
  * Finish "Stage 0" transaction
  * Call systemctl daemon-reload
    - This will trigger all systemd generators (incl. netplan generate) and re-calculate all dependencies
  * Start "Stage 1" systemd transaction: systemctl isolate default.target
    - execute all the normal cloud-init modules and start all the normal services, e.g. via cloud-final.target
  * Finish "Stage 1" transaction
  * System is now fully booted

  The idea here is to split up the boot sequence into two (or more?)
  systemd transactions, so we can call "systemctl daemon-reload" in
  between (but not within a running systemd transaction) to re-run all
  the generators and re-calculate all the dependencies. This way all
  generators would be used in their intended way and should work as
  expected, even on first boot.

  Doing that would also allow users to do interesting things with
  systemd via cloud-config. Like changing the default.target from
  multiuser.target to emergency.target, adding / masking / removing
  units used in early boot, and "just write fstab" and allow systemd-
  fstab-generator to process it, and mount things, etc...

  
  ### Config used to reproduce the problem in a LXD container:
  "systemctl status netplan-ovs-ovs0.service" will show that this unit has not be executed on first boot.

  config:
    user.network-config: |
      # cloud-config
      version: 2
      bridges:
        ovs0:
          addresses: [10.10.10.20/24]
          interfaces: [eth0.21]
          parameters:
            stp: false
          openvswitch: {}
      ethernets:
        eth0:
          addresses: [10.10.10.30/24]
      vlans:
        eth0.21:
          id: 21
          link: eth0
  description: My OVS debugging profile
  devices:
    eth0:
      name: eth0
      network: lxdbr0
      type: nic
    root:
      path: /
      pool: default
      type: disk
  name: myovs

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1892851/+subscriptions


References