debcrafters-packages team mailing list archive
-
debcrafters-packages team
-
Mailing list archive
-
Message #01562
[Bug 2098515] Re: IPv6-only (single stack) instances configuring network over dhcp in initramfs will take a long time to boot due to loop in dhcpcd -4
Thanks for the well-prepared MPs, Renan! I sponsored your work into the
P/O/N/J queues for SRU review.
--
You received this bug notification because you are a member of
Debcrafters packages, which is subscribed to open-iscsi in Ubuntu.
https://bugs.launchpad.net/bugs/2098515
Title:
IPv6-only (single stack) instances configuring network over dhcp in
initramfs will take a long time to boot due to loop in dhcpcd -4
Status in open-iscsi package in Ubuntu:
Fix Released
Status in open-iscsi source package in Jammy:
In Progress
Status in open-iscsi source package in Noble:
In Progress
Status in open-iscsi source package in Oracular:
In Progress
Status in open-iscsi source package in Plucky:
In Progress
Bug description:
[ Impact ]
Oracle Cloud provides users with baremetal instances, and two types of VM instances (native and paravirtualized). Native VMs and baremetal use ISCSI, while the paravirtualized VMs don't.
Oracle requires a single image which can run in all instance types, so it's not possible to provide an image with ISCSI enabled only for the instances that boot from it. Our images set ISCSI_AUTO to be compatible with those. Additionally, clouds generally don't specify command line args at boot so they can't simply enable or disable ISCSI on a per instance basis.
Oracle now has IPV6-only instances. On fully virtualized instances
there is no IP configuration coming from ibft, and
configure_networking() is trying to get network information through
DHCP in initramfs, but starting with IPv4. That generates a
significant delay (up to 5 minutes) when booting. Even the IPv6
address the instance gets is not useful, as the network can be
configured later through cloud-init.
The fix here skips configure_networking(), delegating it to cloud-
init, and speeding up the boot process on Oracle Cloud instances.
[ Test Plan ]
Thanks to Alec Warren <alecwarren19@xxxxxxxxx> for the detailed test plan.
1. Maintains current behaviour by default when cmdline arg is NOT set
a. Test setup:
- Ubuntu image (which uses ISCSI_AUTO mode) containing this change
- New cmdline arg "iscsi_auto_skip_initramfs_networking" NOT set
- Instance configurations:
- non-ISCSI instance on Oracle Cloud (paravirtualized VM)
- ISCSI instance on Oracle Cloud (native VM and Baremetal instance)
b. Test Assertions:
- Verified that the change does nothing and maintains current behavior
- The echo call is NOT in the serial console logs during initramfs stage
- Instance DOES have networking configured during initramfs and ephemeral networking is NOT needed by cloud-init
- Verifiable via cloud-init logs (states that network is configured and does not need to setup ephemeral network)
- Verifiable by presence of /run/net-* files (these are created by configure_networking in initramfs)
2. Does not break ISCSI use case on ISCSI instances when enabled via cmdline arg
a. Test setup:
- Ubuntu image (which uses ISCSI_AUTO mode) containing this change
- New cmdline arg "iscsi_auto_skip_initramfs_networking" IS set using grub
- Instance configuration:
- ISCSI instance on Oracle Cloud (native VM and Baremetal instance)
b.Test Assertions:
- The echo call is NOT in the serial console logs during initramfs stage
- Instance DOES have networking configured during initramfs and ephemeral networking is NOT needed by cloud-init
- Verifiable via cloud-init logs (states that network is configured and does not need to setup ephemeral network)
- Verifiable by presence of /run/net-* files (these are created by configure_networking in initramfs)
3. Skips configuring networking on non-ISCSI instances when enabled via cmdline arg
a. Test setup:
- Ubuntu image (which uses ISCSI_AUTO mode) containing this change
- New cmdline arg "iscsi_auto_skip_initramfs_networking" IS set using grub
- Instance configuration:
- non-ISCSI instance on Oracle Cloud (paravirtualized VM)
b. Test Assertions:
- The echo call IS present in the serial console logs during initramfs stage
- Instance does NOT have networking configured during initramfs and ephemeral networking IS needed and setup by cloud-init
- Verifiable via cloud-init logs (states that there is no networking from initramfs and sets up ephemeral network itself)
- Verifiable by no /run/net-* files existing (these would be created by configure_networking in initramfs)
- Boot speed is measurably faster than normal (~10-12s instead of the normal 20s+)
[ Where problems could occur ]
Because this change targets a bug in a specific scenario, the check is explicitly applying to instances where the flag is present, ISCSI_AUTO is set but there is no ibft data in the system. Mistakes in the logic would make this change run in other scenarios, which is not the goal of this fix.
Any mistake in trying to make this configuration completely opt-in
would break existing instances in the sense that
configure_networking() may not run when it should. To avoid that we
explicitly check for the flag, and don't act if it is not set. The
expected behavior can be verified using the test steps above.
Usage wise, if there is any mistake in setting the flag, the worse
that can happen is that the code won't detect it as it should, and
then the bug triggers, and users will experience longer boot times,
just as it happens now without the change.
[ Other Info ]
As explained above, there is a requiremen from Oracle Cloud that makes it impossible to just unset ISCSI configuration on the images when spinning non-ISCSI instances. This is the reason an opt-in flag is used to opt-out from the network configuration. We know it may be not ideal, but this enables our cloud teams to set the flag on Oracle Cloud images without harming other users - which just don't use it.
This changeset has been forwarded to Debian, but on their side there
were some questions and suggestions to improve the approach taken. If
Debian ends up changing the way this situation is handled, we may
change it in the development release to eliminate, or at least reduce,
the delta which was introduced. However, no new SRUs should happen on
this matter, as this change is considered maintainable for the
foreseeable future.
[ Original Description ]
Cloud instances that configure network over DHCP in initramfs, will go through a "for ROUNDTTT in 30 60 90 120" loop inside configure_networking().
If the DHCP server is only offering a IPv6 (no IPv4), the instance
will take more than 5 minutes to boot, because it will first go
through a loop trying to obtain IPv4 IP (dhcpcd -1KL -t $ROUNDTTT -4
${DEVICE:+"${DEVICE}"}) for 30+60+90+120 seconds (total 300 seconds -
5 minutes), which won't work, until it times out, and then resume the
boot process.
In https://bugs.launchpad.net/ubuntu/+source/initramfs-
tools/+bug/2091904 initramfs-tools improved this situation, looking
for IPv6 information in /sys/firmware/ibft/ethernet*/ip-addr to decide
whether to look for IPv6 or IPv4, however that assumes that IP
information will be available through ibft, which is not always true.
If no IP information is available through ibft, we still go through
this incorrect loop, delaying the boot process.
Example from an instance booting through virtual disks, with no ibft,
and IPv6-only on Oracle Cloud:
```
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.0-1001-oracle root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0 nvme.shutdown_timeout=10 libiscsi.debug_libiscsi_eh=1 crash_kexec_post_notifiers
[...]
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... [ 2.863248] No iBFT detected.
Could not setup fw entries.
Begin: Waiting up to 180 secs for any network device to become available ... done.
dhcpcd-10.1.0 starting
dev: loaded udev
[ 2.906793] 8021q: 802.1Q VLAN Support v1.8
[ 2.917496] 8021q: adding VLAN 0 to HW filter on device enp0s5
DUID 00:03:00:01:02:00:17:36:95:6d
enp0s5: IAID 17:36:95:6d
enp0s5: carrier acquired
enp0s5: IAID 17:36:95:6d
[ 2.983134] workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
enp0s5: soliciting a DHCP lease
timed out
exiting due to oneshot
dhcpcd exited
Sleeping 0 seconds before retrying getting a DHCP lease
dhcpcd-10.1.0 starting
dev: loaded udev
DUID 00:03:00:01:02:00:17:36:95:6d
enp0s5: IAID 17:36:95:6d
enp0s5: soliciting a DHCP lease
timed out
exiting due to oneshot
dhcpcd exited
Sleeping 0 seconds before retrying getting a DHCP lease
dhcpcd-10.1.0 starting
dev: loaded udev
DUID 00:03:00:01:02:00:17:36:95:6d
enp0s5: IAID 17:36:95:6d
enp0s5: soliciting a DHCP lease
timed out
exiting due to oneshot
dhcpcd exited
Sleeping 0 seconds before retrying getting a DHCP lease
dhcpcd-10.1.0 starting
dev: loaded udev
DUID 00:03:00:01:02:00:17:36:95:6d
enp0s5: IAID 17:36:95:6d
enp0s5: soliciting a DHCP lease
timed out
exiting due to oneshot
dhcpcd exited
Sleeping 0 seconds before retrying getting a DHCP lease
no search or nameservers found in /run/net-.conf /run/net-*.conf /run/net6-*.conf
[ 303.057039] Loading iSCSI transport class v2.0-870.
[ 303.069113] iscsi: registered transport (tcp)
Could not get boot entry.
done.
```
Full log: https://pastebin.ubuntu.com/p/Sk5dcvpPyY/
We can see such loop between lines 1136 and 1176.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/open-iscsi/+bug/2098515/+subscriptions