← Back to team overview

debcrafters-packages team mailing list archive

[Bug 2098515] Re: IPv6-only (single stack) instances configuring network over dhcp in initramfs will take a long time to boot due to loop in dhcpcd -4

 

Thanks for the well-prepared MPs, Renan! I sponsored your work into the
P/O/N/J queues for SRU review.

-- 
You received this bug notification because you are a member of
Debcrafters packages, which is subscribed to open-iscsi in Ubuntu.
https://bugs.launchpad.net/bugs/2098515

Title:
  IPv6-only (single stack) instances configuring network over dhcp in
  initramfs will take a long time to boot due to loop in dhcpcd -4

Status in open-iscsi package in Ubuntu:
  Fix Released
Status in open-iscsi source package in Jammy:
  In Progress
Status in open-iscsi source package in Noble:
  In Progress
Status in open-iscsi source package in Oracular:
  In Progress
Status in open-iscsi source package in Plucky:
  In Progress

Bug description:
  [ Impact ]
  Oracle Cloud provides users with baremetal instances, and two types of VM instances (native and paravirtualized). Native VMs and baremetal use ISCSI, while the paravirtualized VMs don't.
  Oracle requires a single image which can run in all instance types, so it's not possible to provide an image with ISCSI enabled only for the instances that boot from it. Our images set ISCSI_AUTO to be compatible with those. Additionally, clouds generally don't specify command line args at boot so they can't simply enable or disable ISCSI on a per instance basis.

  Oracle now has IPV6-only instances. On fully virtualized instances
  there is no IP configuration coming from ibft, and
  configure_networking() is trying to get network information through
  DHCP in initramfs, but starting with IPv4. That generates a
  significant delay (up to 5 minutes) when booting. Even the IPv6
  address the instance gets is not useful, as the network can be
  configured later through cloud-init.

  The fix here skips configure_networking(), delegating it to cloud-
  init, and speeding up the boot process on Oracle Cloud instances.

  [ Test Plan ]
  Thanks to Alec Warren <alecwarren19@xxxxxxxxx> for the detailed test plan.

  1. Maintains current behaviour by default when cmdline arg is NOT set
    a. Test setup:
      - Ubuntu image (which uses ISCSI_AUTO mode) containing this change
      - New cmdline arg "iscsi_auto_skip_initramfs_networking" NOT set
      - Instance configurations:
        - non-ISCSI instance on Oracle Cloud (paravirtualized VM)
        - ISCSI instance on Oracle Cloud (native VM and Baremetal instance)

    b. Test Assertions:
      - Verified that the change does nothing and maintains current behavior
      - The echo call is NOT in the serial console logs during initramfs stage
      - Instance DOES have networking configured during initramfs and ephemeral networking is NOT needed by cloud-init
        - Verifiable via cloud-init logs (states that network is configured and does not need to setup ephemeral network)
        - Verifiable by presence of /run/net-* files (these are created by configure_networking in initramfs)

  2. Does not break ISCSI use case on ISCSI instances when enabled via cmdline arg
    a. Test setup:
      - Ubuntu image (which uses ISCSI_AUTO mode) containing this change
      - New cmdline arg "iscsi_auto_skip_initramfs_networking" IS set using grub
      - Instance configuration:
        - ISCSI instance on Oracle Cloud (native VM and Baremetal instance)

    b.Test Assertions:
      - The echo call is NOT in the serial console logs during initramfs stage
      - Instance DOES have networking configured during initramfs and ephemeral networking is NOT needed by cloud-init
        - Verifiable via cloud-init logs (states that network is configured and does not need to setup ephemeral network)
        - Verifiable by presence of /run/net-* files (these are created by configure_networking in initramfs)

  3. Skips configuring networking on non-ISCSI instances when enabled via cmdline arg
    a. Test setup:
      - Ubuntu image (which uses ISCSI_AUTO mode) containing this change
      - New cmdline arg "iscsi_auto_skip_initramfs_networking" IS set using grub
      - Instance configuration:
        - non-ISCSI instance on Oracle Cloud (paravirtualized VM)

    b. Test Assertions:
      - The echo call IS present in the serial console logs during initramfs stage
      - Instance does NOT have networking configured during initramfs and ephemeral networking IS needed and setup by cloud-init
        - Verifiable via cloud-init logs (states that there is no networking from initramfs and sets up ephemeral network itself)
        - Verifiable by no /run/net-* files existing (these would be created by configure_networking in initramfs)
      - Boot speed is measurably faster than normal (~10-12s instead of the normal 20s+)

  [ Where problems could occur ]
  Because this change targets a bug in a specific scenario, the check is explicitly applying to instances where the flag is present, ISCSI_AUTO is set but there is no ibft data in the system. Mistakes in the logic would make this change run in other scenarios, which is not the goal of this fix.

  Any mistake in trying to make this configuration completely opt-in
  would break existing instances in the sense that
  configure_networking() may not run when it should. To avoid that we
  explicitly check for the flag, and don't act if it is not set. The
  expected behavior can be verified using the test steps above.

  Usage wise, if there is any mistake in setting the flag, the worse
  that can happen is that the code won't detect it as it should, and
  then the bug triggers, and users will experience longer boot times,
  just as it happens now without the change.

  [ Other Info ]
  As explained above, there is a requiremen from Oracle Cloud that makes it impossible to just unset ISCSI configuration on the images when spinning non-ISCSI instances. This is the reason an opt-in flag is used to opt-out from the network configuration. We know it may be not ideal, but this enables our cloud teams to set the flag on Oracle Cloud images without harming other users - which just don't use it.

  This changeset has been forwarded to Debian, but on their side there
  were some questions and suggestions to improve the approach taken. If
  Debian ends up changing the way this situation is handled, we may
  change it in the development release to eliminate, or at least reduce,
  the delta which was introduced. However, no new SRUs should happen on
  this matter, as this change is considered maintainable for the
  foreseeable future.

  [ Original Description ]
  Cloud instances that configure network over DHCP in initramfs, will go through a "for ROUNDTTT in 30 60 90 120" loop inside configure_networking().

  If the DHCP server is only offering a IPv6 (no IPv4), the instance
  will take more than 5 minutes to boot, because it will first go
  through a loop trying to obtain IPv4 IP (dhcpcd -1KL -t $ROUNDTTT -4
  ${DEVICE:+"${DEVICE}"}) for 30+60+90+120 seconds (total 300 seconds -
  5 minutes), which won't work, until it times out, and then resume the
  boot process.

  In https://bugs.launchpad.net/ubuntu/+source/initramfs-
  tools/+bug/2091904 initramfs-tools improved this situation, looking
  for IPv6 information in /sys/firmware/ibft/ethernet*/ip-addr to decide
  whether to look for IPv6 or IPv4, however that assumes that IP
  information will be available through ibft, which is not always true.

  If no IP information is available through ibft, we still go through
  this incorrect loop, delaying the boot process.

  Example from an instance booting through virtual disks, with no ibft,
  and IPv6-only on Oracle Cloud:

  ```
  [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.0-1001-oracle root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0 nvme.shutdown_timeout=10 libiscsi.debug_libiscsi_eh=1 crash_kexec_post_notifiers
  [...]
  Begin: Running /scripts/init-premount ... done.
  Begin: Mounting root file system ... Begin: Running /scripts/local-top ... [    2.863248] No iBFT detected.
  Could not setup fw entries.
  Begin: Waiting up to 180 secs for any network device to become available ... done.
  dhcpcd-10.1.0 starting
  dev: loaded udev
  [    2.906793] 8021q: 802.1Q VLAN Support v1.8
  [    2.917496] 8021q: adding VLAN 0 to HW filter on device enp0s5
  DUID 00:03:00:01:02:00:17:36:95:6d
  enp0s5: IAID 17:36:95:6d
  enp0s5: carrier acquired
  enp0s5: IAID 17:36:95:6d
  [    2.983134] workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND
  enp0s5: soliciting a DHCP lease
  timed out
  exiting due to oneshot
  dhcpcd exited
  Sleeping 0 seconds before retrying getting a DHCP lease
  dhcpcd-10.1.0 starting
  dev: loaded udev
  DUID 00:03:00:01:02:00:17:36:95:6d
  enp0s5: IAID 17:36:95:6d
  enp0s5: soliciting a DHCP lease
  timed out
  exiting due to oneshot
  dhcpcd exited
  Sleeping 0 seconds before retrying getting a DHCP lease
  dhcpcd-10.1.0 starting
  dev: loaded udev
  DUID 00:03:00:01:02:00:17:36:95:6d
  enp0s5: IAID 17:36:95:6d
  enp0s5: soliciting a DHCP lease
  timed out
  exiting due to oneshot
  dhcpcd exited
  Sleeping 0 seconds before retrying getting a DHCP lease
  dhcpcd-10.1.0 starting
  dev: loaded udev
  DUID 00:03:00:01:02:00:17:36:95:6d
  enp0s5: IAID 17:36:95:6d
  enp0s5: soliciting a DHCP lease
  timed out
  exiting due to oneshot
  dhcpcd exited
  Sleeping 0 seconds before retrying getting a DHCP lease
  no search or nameservers found in /run/net-.conf /run/net-*.conf /run/net6-*.conf
  [  303.057039] Loading iSCSI transport class v2.0-870.
  [  303.069113] iscsi: registered transport (tcp)
  Could not get boot entry.
  done.
  ```

  Full log: https://pastebin.ubuntu.com/p/Sk5dcvpPyY/

  We can see such loop between lines 1136 and 1176.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/open-iscsi/+bug/2098515/+subscriptions