← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1998655] [NEW] Cloud-init runs before network interface is ready on RHEL 8.7

 

Public bug reported:

We're running RHEL 8.7 VMs on Apache CloudStack.  The cloud-init service
waits for Network Manager to start, but it runs before Network Manager
is done.  Because the network isn't ready yet, the CloudStack data
source is unable to determine the virtual router address.  This causes
cloud-init to fall back to using DataSourceNone.  That leaves the VM
without the proper configuration.

I see one bug, and one potential bug here.  The bug is that cloud-init
is starting too early in the boot process.  The potential bug is that
the CloudStack data source doesn't have any retry logic when it looks
for the virtual router.

To fix the timing but, I recommend adding `After=NetworkManager-wait-
online.service` to the cloud-init.service file.  That will make it wait
until after the network interface is ready.  Note that it's already
waiting for systemd-networkd-wait-online.service, but that service
doesn't exist in RHEL 8.7, so systemd ignores that line.

You can see the timing issue in this excerpt from /var/log/messages:

Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.5790] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6212] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6213] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6215] manager: NetworkManager state is now CONNECTED_SITE
Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6217] device (ens3): Activation: successful, device activated.
Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6220] manager: NetworkManager state is now CONNECTED_GLOBAL
Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6222] manager: startup complete
Dec  2 18:27:04 mark-test-fs systemd[1]: Started Network Manager Wait Online.
Dec  2 18:27:04 mark-test-fs systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
Dec  2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,027 - util.py[WARNING]: Failed to fetch password from virtual router 10.11.8.0
Dec  2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,043 - stages.py[INFO]: Loaded datasource DataSourceCloudStack - DataSourceCloudStack
Dec  2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,093 - stages.py[INFO]: Applying network configuration from fallback bringup=True: {'ethernets': {'ens3': {'dhcp4': True, 'set-name': 'ens3', 'match': {'macaddress': '1e:00:29:00:02:b4'}}}, 'version': 2}
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1480] agent-manager: agent[f74dfcab0129a270,:1.13/nmcli-connect/0]: agent registered
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1487] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1489] manager: NetworkManager state is now DISCONNECTING
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1494] device (ens3): disconnecting for new activation request.
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1494] audit: op="connection-activate" uuid="21d47e65-8523-1a06-af22-6f121086f085" name="System ens3" pid=975 uid=0 result="success"
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2131] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2247] dhcp4 (ens3): canceled DHCP transaction
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2247] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2247] dhcp4 (ens3): state changed no lease
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2770] manager: NetworkManager state is now DISCONNECTED
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2771] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085)
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2777] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2778] manager: NetworkManager state is now CONNECTING
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2779] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2790] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2791] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2855] dhcp4 (ens3): state changed new lease, address=10.11.8.179
Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2858] policy: set 'System ens3' (ens3) as default for IPv4 routing and DNS

** Affects: cloud-init
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1998655

Title:
  Cloud-init runs before network interface is ready on RHEL 8.7

Status in cloud-init:
  New

Bug description:
  We're running RHEL 8.7 VMs on Apache CloudStack.  The cloud-init
  service waits for Network Manager to start, but it runs before Network
  Manager is done.  Because the network isn't ready yet, the CloudStack
  data source is unable to determine the virtual router address.  This
  causes cloud-init to fall back to using DataSourceNone.  That leaves
  the VM without the proper configuration.

  I see one bug, and one potential bug here.  The bug is that cloud-init
  is starting too early in the boot process.  The potential bug is that
  the CloudStack data source doesn't have any retry logic when it looks
  for the virtual router.

  To fix the timing but, I recommend adding `After=NetworkManager-wait-
  online.service` to the cloud-init.service file.  That will make it
  wait until after the network interface is ready.  Note that it's
  already waiting for systemd-networkd-wait-online.service, but that
  service doesn't exist in RHEL 8.7, so systemd ignores that line.

  You can see the timing issue in this excerpt from /var/log/messages:

  Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.5790] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
  Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6212] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
  Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6213] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
  Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6215] manager: NetworkManager state is now CONNECTED_SITE
  Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6217] device (ens3): Activation: successful, device activated.
  Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6220] manager: NetworkManager state is now CONNECTED_GLOBAL
  Dec  2 18:27:04 mark-test-fs NetworkManager[723]: <info>  [1670005624.6222] manager: startup complete
  Dec  2 18:27:04 mark-test-fs systemd[1]: Started Network Manager Wait Online.
  Dec  2 18:27:04 mark-test-fs systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
  Dec  2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,027 - util.py[WARNING]: Failed to fetch password from virtual router 10.11.8.0
  Dec  2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,043 - stages.py[INFO]: Loaded datasource DataSourceCloudStack - DataSourceCloudStack
  Dec  2 18:27:05 mark-test-fs journal[938]: [CLOUDINIT]2022-12-02 18:27:05,093 - stages.py[INFO]: Applying network configuration from fallback bringup=True: {'ethernets': {'ens3': {'dhcp4': True, 'set-name': 'ens3', 'match': {'macaddress': '1e:00:29:00:02:b4'}}}, 'version': 2}
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1480] agent-manager: agent[f74dfcab0129a270,:1.13/nmcli-connect/0]: agent registered
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1487] device (ens3): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1489] manager: NetworkManager state is now DISCONNECTING
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1494] device (ens3): disconnecting for new activation request.
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.1494] audit: op="connection-activate" uuid="21d47e65-8523-1a06-af22-6f121086f085" name="System ens3" pid=975 uid=0 result="success"
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2131] device (ens3): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2247] dhcp4 (ens3): canceled DHCP transaction
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2247] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2247] dhcp4 (ens3): state changed no lease
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2770] manager: NetworkManager state is now DISCONNECTED
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2771] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085)
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2777] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2778] manager: NetworkManager state is now CONNECTING
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2779] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2790] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2791] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2855] dhcp4 (ens3): state changed new lease, address=10.11.8.179
  Dec  2 18:27:05 mark-test-fs NetworkManager[723]: <info>  [1670005625.2858] policy: set 'System ens3' (ens3) as default for IPv4 routing and DNS

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1998655/+subscriptions



Follow ups