← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1936972] [NEW] MAAS deploys fail if host has NIC w/ random MAC

 

Public bug reported:

The Nvidia DGX A100 server includes a USB Redfish Host Interface NIC.
This NIC apparently provides no MAC address of it's own, so the driver
generates a random MAC for it:

./drivers/net/usb/cdc_ether.c:

static int usbnet_cdc_zte_bind(struct usbnet *dev, struct usb_interface *intf)
{
        int status = usbnet_cdc_bind(dev, intf);

        if (!status && (dev->net->dev_addr[0] & 0x02))
                eth_hw_addr_random(dev->net);

        return status;
}

This causes a problem with MAAS because, during deployment, MAAS sees
this as a normal NIC and records the MAC. The post-install reboot then
fails:

[   43.652573] cloud-init[3761]:     init.apply_network_config(bring_up=not args.local)
[   43.700516] cloud-init[3761]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 735, in apply_network_config
[   43.724496] cloud-init[3761]:     self.distro.networking.wait_for_physdevs(netcfg)
[   43.740509] cloud-init[3761]:   File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in wait_for_physdevs
[   43.764523] cloud-init[3761]:     raise RuntimeError(msg)
[   43.780511] cloud-init[3761]: RuntimeError: Not all expected physical devices present: {'fe:b8:63:69:9f:71'}

I'm not sure what the best answer for MAAS is here, but here's some
thoughts:

1) Ignore all Redfish system interfaces. These are a connect between the host and the BMC, so they don't really have a use-case in the MAAS model AFAICT. These devices can be identified using the SMBIOS as described in the Redfish Host Interface Specification, section 8:
  https://www.dmtf.org/sites/default/files/standards/documents/DSP0270_1.3.0.pdf
Which can be read from within Linux using dmidecode.

2) Ignore (or specially handle) all NICs with randomly generated MAC
addresses. While this is the only time I've seen the random MAC with
production server hardware, it is something I've seen on e.g. ARM
development boards. Problem is, I don't know how to detect a generated
MAC. I'd hoped the permanent MAC (ethtool -P) MAC would be NULL, but it
seems to also be set to the generated MAC :(

fyi, 2 workarounds for this that seem to work:
 1) Delete the NIC from the MAAS model in the MAAS UI after every commissioning.
 2) Use a tag's kernel_opts field to modprobe.blacklist the driver used for the Redfish NIC.

** Affects: cloud-init
     Importance: Undecided
         Status: New

** Affects: curtin
     Importance: Undecided
         Status: New

** Affects: maas
     Importance: Undecided
         Status: New

** Also affects: cloud-init
   Importance: Undecided
       Status: New

** Also affects: curtin
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1936972

Title:
  MAAS deploys fail if host has NIC w/ random MAC

Status in cloud-init:
  New
Status in curtin:
  New
Status in MAAS:
  New

Bug description:
  The Nvidia DGX A100 server includes a USB Redfish Host Interface NIC.
  This NIC apparently provides no MAC address of it's own, so the driver
  generates a random MAC for it:

  ./drivers/net/usb/cdc_ether.c:

  static int usbnet_cdc_zte_bind(struct usbnet *dev, struct usb_interface *intf)
  {
          int status = usbnet_cdc_bind(dev, intf);

          if (!status && (dev->net->dev_addr[0] & 0x02))
                  eth_hw_addr_random(dev->net);

          return status;
  }

  This causes a problem with MAAS because, during deployment, MAAS sees
  this as a normal NIC and records the MAC. The post-install reboot then
  fails:

  [   43.652573] cloud-init[3761]:     init.apply_network_config(bring_up=not args.local)
  [   43.700516] cloud-init[3761]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 735, in apply_network_config
  [   43.724496] cloud-init[3761]:     self.distro.networking.wait_for_physdevs(netcfg)
  [   43.740509] cloud-init[3761]:   File "/usr/lib/python3/dist-packages/cloudinit/distros/networking.py", line 177, in wait_for_physdevs
  [   43.764523] cloud-init[3761]:     raise RuntimeError(msg)
  [   43.780511] cloud-init[3761]: RuntimeError: Not all expected physical devices present: {'fe:b8:63:69:9f:71'}

  I'm not sure what the best answer for MAAS is here, but here's some
  thoughts:

  1) Ignore all Redfish system interfaces. These are a connect between the host and the BMC, so they don't really have a use-case in the MAAS model AFAICT. These devices can be identified using the SMBIOS as described in the Redfish Host Interface Specification, section 8:
    https://www.dmtf.org/sites/default/files/standards/documents/DSP0270_1.3.0.pdf
  Which can be read from within Linux using dmidecode.

  2) Ignore (or specially handle) all NICs with randomly generated MAC
  addresses. While this is the only time I've seen the random MAC with
  production server hardware, it is something I've seen on e.g. ARM
  development boards. Problem is, I don't know how to detect a generated
  MAC. I'd hoped the permanent MAC (ethtool -P) MAC would be NULL, but
  it seems to also be set to the generated MAC :(

  fyi, 2 workarounds for this that seem to work:
   1) Delete the NIC from the MAAS model in the MAAS UI after every commissioning.
   2) Use a tag's kernel_opts field to modprobe.blacklist the driver used for the Redfish NIC.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1936972/+subscriptions



Follow ups