yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #90440
[Bug 1844191] Re: azure advanced networking sometimes triggers duplicate mac detection
** Description changed:
+ === Begin SRU Template ===
+ [Impact]
+ When accelerated network is enabled on Azure, the host presents two network interfaces with the same mac address to the VM:
+ a synthetic nic (netvsc) and a VF nic, which is enslaved to the synthetic nic.
+
+ The net module is already excluding slave nics when enumerating
+ interfaces. However, if cloud-init starts enumerating after the kernel
+ makes the VF visible to userspace, but before the enslaving has
+ finished, cloud-init will see two nics with duplicate mac.
+
+ [Test Case]
+ Launch an instance with accelerated networking and ensure the instance comes up as expected with no networking-related Tracebacks in /var/log/cloud-init.log
+
+ [Regression Potential]
+ This is already in error handling code and is scoped to a particular driver. A regression here would mean we could allow a cloud-init instance to come up with duplicate macs when we otherwise wouldn't.
+
+ [Other info]
+ This bug was attempted but could not be reproduced by the cloud-init team. It was reported as being seen in "1 in 1000" launches.
+
+ Github PR: https://github.com/canonical/cloud-init/pull/1853
+
+ === End SRU Template ===
+
+ Initial bug:
+
Hi, we're still being affected by this on Azure with
19.2-24-ge7881d5c-0ubuntu1~18.04.1 - using PACKER to build from image:
BuildSource : Marketplace/Canonical/UbuntuServer/18.04-DAILY-LTS
Here is the packer config:
````
- "provisioners": [
- {
- "type": "shell",
- "inline": [
- "while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 'Waiting for cloud-init...'; sleep 1; done"
- ]
- },
- {
- "type": "ansible",
- "playbook_file": "{{user `ansible_playbook`}}",
- "user": "packer",
- "extra_arguments": [ "--extra-vars", "codeVersion={{user `code_version`}} managed_image_name={{user `managed_image_name`}}" ]
- },
- {
- "type": "shell",
- "execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
- "inline_shebang": "/bin/sh -x",
- "inline": [ "/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync" ]
- }]
+ "provisioners": [
+ {
+ "type": "shell",
+ "inline": [
+ "while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 'Waiting for cloud-init...'; sleep 1; done"
+ ]
+ },
+ {
+ "type": "ansible",
+ "playbook_file": "{{user `ansible_playbook`}}",
+ "user": "packer",
+ "extra_arguments": [ "--extra-vars", "codeVersion={{user `code_version`}} managed_image_name={{user `managed_image_name`}}" ]
+ },
+ {
+ "type": "shell",
+ "execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
+ "inline_shebang": "/bin/sh -x",
+ "inline": [ "/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync" ]
+ }]
````
Here is the playbook:
````
---
- hosts: all
- remote_user: ubuntu
- become: yes
- become_method: sudo
- become_user: root
+ remote_user: ubuntu
+ become: yes
+ become_method: sudo
+ become_user: root
- environment:
- DEBIAN_FRONTEND: noninteractive
+ environment:
+ DEBIAN_FRONTEND: noninteractive
````
Note: we are applying `enableAcceleratedNetworking: true` to the NIC,
anecdotally we think this is related.
Usually our playbook has more in it (obviously) but Azure kept pointing
fingers at us that our image was causing the problem, so I ran this test
simply deploying a blank deprovisioned image via our same process.
And here's what happens on the serial console log:
````
[ 20.337603] sh[910]: + [ -e /var/lib/cloud/instance/obj.pkl ]
[ 20.343177] sh[910]: + echo cleaning persistent cloud-init object
[ 20.349027] [ OK ] Started Network Time Synchronization.
[ OK ] Reached target System Time Synchronized.
sh[910]: cleaning persistent cloud-init object
[ 20.361066] sh[910]: + rm /var/lib/cloud/instance/obj.pkl
[ 20.412333] sh[910]: + exit 0
[ 34.282291] cloud-init[938]: Cloud-init v. 19.2-24-ge7881d5c-0ubuntu1~18.04.1 running 'init-local' at Mon, 16 Sep 2019 18:02:23 +0000. Up 32.02 seconds.
[ 34.288809] cloud-init[938]: 2019-09-16 18:02:25,262 - util.py[WARNING]: failed stage init-local
[ 34.423057] cloud-init[938]: failed run of stage init-local
[ 34.437716] cloud-init[938]: ------------------------------------------------------------
[ 34.441088] cloud-init[938]: Traceback (most recent call last):
[ 34.443719] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
[ 34.448072] cloud-init[938]: ret = functor(name, args)
[ 34.450532] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 362, in main_init
[ 34.454849] cloud-init[938]: init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
[ 34.458725] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 697, in apply_network_config
[ 34.463421] cloud-init[938]: net.wait_for_physdevs(netcfg)
[ 34.466051] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in wait_for_physdevs
[ 34.470673] cloud-init[938]: present_macs = get_interfaces_by_mac().keys()
[ 34.473964] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in get_interfaces_by_mac
[ 34.479325] cloud-init[938]: (name, ret[mac], mac))
[ 34.481838] cloud-init[938]: RuntimeError: duplicate mac found! both 'eth0' and 'enP1s1' have mac '00:0d:3a:7c:f7:3f'
[ 34.486614] cloud-init[938]: ------------------------------------------------------------
[FAILED] Failed to start Initial cloud-init job (pre-networking).
See 'systemctl status cloud-init-local.service' for details.
[ OK ] Reached target Network (Pre).
- Starting Network Service...
+ Starting Network Service...
[ OK ] Started Network Service.
- Starting Wait for Network to be Configured...
- Starting Network Name Resolution...
+ Starting Wait for Network to be Configured...
+ Starting Network Name Resolution...
[ OK ] Started Wait for Network to be Configured.
- Starting Initial cloud-init job (metadata service crawler)...
+ Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
[ OK ] Reached target Network.
````
When this happens, the machine never boots, and we get an
OSProvisioningTimedOut error after about 30 minutes, and the machine
never reaches healthy state.
** Also affects: cloud-init (Ubuntu)
Importance: Undecided
Status: New
** Also affects: cloud-init (Ubuntu Focal)
Importance: Undecided
Status: New
** Also affects: cloud-init (Ubuntu Bionic)
Importance: Undecided
Status: New
** Also affects: cloud-init (Ubuntu Kinetic)
Importance: Undecided
Status: New
** Also affects: cloud-init (Ubuntu Jammy)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/1844191
Title:
azure advanced networking sometimes triggers duplicate mac detection
Status in cloud-init:
Fix Released
Status in cloud-init package in Ubuntu:
New
Status in cloud-init source package in Bionic:
New
Status in cloud-init source package in Focal:
New
Status in cloud-init source package in Jammy:
New
Status in cloud-init source package in Kinetic:
New
Bug description:
=== Begin SRU Template ===
[Impact]
When accelerated network is enabled on Azure, the host presents two network interfaces with the same mac address to the VM:
a synthetic nic (netvsc) and a VF nic, which is enslaved to the synthetic nic.
The net module is already excluding slave nics when enumerating
interfaces. However, if cloud-init starts enumerating after the kernel
makes the VF visible to userspace, but before the enslaving has
finished, cloud-init will see two nics with duplicate mac.
[Test Case]
Launch an instance with accelerated networking and ensure the instance comes up as expected with no networking-related Tracebacks in /var/log/cloud-init.log
[Regression Potential]
This is already in error handling code and is scoped to a particular driver. A regression here would mean we could allow a cloud-init instance to come up with duplicate macs when we otherwise wouldn't.
[Other info]
This bug was attempted but could not be reproduced by the cloud-init team. It was reported as being seen in "1 in 1000" launches.
Github PR: https://github.com/canonical/cloud-init/pull/1853
=== End SRU Template ===
Initial bug:
Hi, we're still being affected by this on Azure with
19.2-24-ge7881d5c-0ubuntu1~18.04.1 - using PACKER to build from image:
BuildSource : Marketplace/Canonical/UbuntuServer/18.04-DAILY-LTS
Here is the packer config:
````
"provisioners": [
{
"type": "shell",
"inline": [
"while [ ! -f /var/lib/cloud/instance/boot-finished ]; do echo 'Waiting for cloud-init...'; sleep 1; done"
]
},
{
"type": "ansible",
"playbook_file": "{{user `ansible_playbook`}}",
"user": "packer",
"extra_arguments": [ "--extra-vars", "codeVersion={{user `code_version`}} managed_image_name={{user `managed_image_name`}}" ]
},
{
"type": "shell",
"execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
"inline_shebang": "/bin/sh -x",
"inline": [ "/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync" ]
}]
````
Here is the playbook:
````
---
- hosts: all
remote_user: ubuntu
become: yes
become_method: sudo
become_user: root
environment:
DEBIAN_FRONTEND: noninteractive
````
Note: we are applying `enableAcceleratedNetworking: true` to the NIC,
anecdotally we think this is related.
Usually our playbook has more in it (obviously) but Azure kept
pointing fingers at us that our image was causing the problem, so I
ran this test simply deploying a blank deprovisioned image via our
same process.
And here's what happens on the serial console log:
````
[ 20.337603] sh[910]: + [ -e /var/lib/cloud/instance/obj.pkl ]
[ 20.343177] sh[910]: + echo cleaning persistent cloud-init object
[ 20.349027] [ OK ] Started Network Time Synchronization.
[ OK ] Reached target System Time Synchronized.
sh[910]: cleaning persistent cloud-init object
[ 20.361066] sh[910]: + rm /var/lib/cloud/instance/obj.pkl
[ 20.412333] sh[910]: + exit 0
[ 34.282291] cloud-init[938]: Cloud-init v. 19.2-24-ge7881d5c-0ubuntu1~18.04.1 running 'init-local' at Mon, 16 Sep 2019 18:02:23 +0000. Up 32.02 seconds.
[ 34.288809] cloud-init[938]: 2019-09-16 18:02:25,262 - util.py[WARNING]: failed stage init-local
[ 34.423057] cloud-init[938]: failed run of stage init-local
[ 34.437716] cloud-init[938]: ------------------------------------------------------------
[ 34.441088] cloud-init[938]: Traceback (most recent call last):
[ 34.443719] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 653, in status_wrapper
[ 34.448072] cloud-init[938]: ret = functor(name, args)
[ 34.450532] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 362, in main_init
[ 34.454849] cloud-init[938]: init.apply_network_config(bring_up=bool(mode != sources.DSMODE_LOCAL))
[ 34.458725] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 697, in apply_network_config
[ 34.463421] cloud-init[938]: net.wait_for_physdevs(netcfg)
[ 34.466051] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 344, in wait_for_physdevs
[ 34.470673] cloud-init[938]: present_macs = get_interfaces_by_mac().keys()
[ 34.473964] cloud-init[938]: File "/usr/lib/python3/dist-packages/cloudinit/net/__init__.py", line 633, in get_interfaces_by_mac
[ 34.479325] cloud-init[938]: (name, ret[mac], mac))
[ 34.481838] cloud-init[938]: RuntimeError: duplicate mac found! both 'eth0' and 'enP1s1' have mac '00:0d:3a:7c:f7:3f'
[ 34.486614] cloud-init[938]: ------------------------------------------------------------
[FAILED] Failed to start Initial cloud-init job (pre-networking).
See 'systemctl status cloud-init-local.service' for details.
[ OK ] Reached target Network (Pre).
Starting Network Service...
[ OK ] Started Network Service.
Starting Wait for Network to be Configured...
Starting Network Name Resolution...
[ OK ] Started Wait for Network to be Configured.
Starting Initial cloud-init job (metadata service crawler)...
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
[ OK ] Reached target Network.
````
When this happens, the machine never boots, and we get an
OSProvisioningTimedOut error after about 30 minutes, and the machine
never reaches healthy state.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1844191/+subscriptions
References