yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #83825
[Bug 1892851] Re: Staged boot, to fix integration of systemd generators
We found a fix for this problem in netplan itself.
The overall idea of a staged-boot environment should probably still be considered for a future release of cloud-init.
https://github.com/CanonicalLtd/netplan/pull/162
** Also affects: netplan
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1892851
Title:
Staged boot, to fix integration of systemd generators
Status in cloud-init:
Confirmed
Status in netplan:
New
Bug description:
[Intro]
Cloud-init makes use of the "netplan" systemd generator, but calls "netplan generate" manually at runtime, while currently executing the initial systemd boot transaction, instead of running it as intended via "systemctl daemon-reload" at systemd generator stage, due to restrictions it has regarding fetching of its data source (e.g. netplan YAML config).
[Problem]
This leads to problems at first boot, as the systemd unit dependencies are calculated after the generator stage, but ahead of the boot transaction (e.g. via systemctl daemon-reload), therefore the new service units and its dependencies, which are generated by manually calling systemd generators are ignored during the first-boot transaction. In subsequent boots (where the cloud-init data source, netplan YAML config and unit files are already in place), everything works as expected.
It is a tricky situation, as cloud-init
1/ does not have the full config to run the systemd generators (e.g. netplan YAML) yet before the systemd boot transaction. It first needs to fetch it via a DataSource, possibly via a network connection.
2/ cannot execute the generators manually (e.g. "netplan generate") during the systemd boot transaction, because this way the newly generated service units and corresponding dependencies will be ignored.
3/ cannot re-execute the systemd generators after the initial boot transaction, as it is already too late at this point and applications expect to have a readily configured network setup after cloud-final.target has been reached.
[References]
Such problems have been reported and discussed for WiFi on RaspberryPi (LP: #1870346) or Open vSwitch setups in MAAS (https://github.com/CanonicalLtd/netplan/pull/157), where some of the generated service units/dependencies (netplan-ovs-*.service or netplan-wpa-*.service, possibly SR-IOV units as well...) are not properly executed on first boot.
[Suggestion]
A possible solution I discussed with @xnox would be to re-engineer how cloud-init targets work a bit, by splitting up the cloud-init boot sequence into multiple stages, e.g.:
* Start "Stage 0" systemd transaction: systemctl isolate cloud-stage0.target
- execute the init local modules
- setup basic networking (DHCP on eth0/ens3)
- fetch data source & place netplan YAML in /etc/netplan/
* Finish "Stage 0" transaction
* Call systemctl daemon-reload
- This will trigger all systemd generators (incl. netplan generate) and re-calculate all dependencies
* Start "Stage 1" systemd transaction: systemctl isolate default.target
- execute all the normal cloud-init modules and start all the normal services, e.g. via cloud-final.target
* Finish "Stage 1" transaction
* System is now fully booted
The idea here is to split up the boot sequence into two (or more?)
systemd transactions, so we can call "systemctl daemon-reload" in
between (but not within a running systemd transaction) to re-run all
the generators and re-calculate all the dependencies. This way all
generators would be used in their intended way and should work as
expected, even on first boot.
Doing that would also allow users to do interesting things with
systemd via cloud-config. Like changing the default.target from
multiuser.target to emergency.target, adding / masking / removing
units used in early boot, and "just write fstab" and allow systemd-
fstab-generator to process it, and mount things, etc...
### Config used to reproduce the problem in a LXD container:
"systemctl status netplan-ovs-ovs0.service" will show that this unit has not be executed on first boot.
config:
user.network-config: |
# cloud-config
version: 2
bridges:
ovs0:
addresses: [10.10.10.20/24]
interfaces: [eth0.21]
parameters:
stp: false
openvswitch: {}
ethernets:
eth0:
addresses: [10.10.10.30/24]
vlans:
eth0.21:
id: 21
link: eth0
description: My OVS debugging profile
devices:
eth0:
name: eth0
network: lxdbr0
type: nic
root:
path: /
pool: default
type: disk
name: myovs
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1892851/+subscriptions
References