yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #73754
[Bug 1781129] [NEW] linuxbridge-agent missed updated device sometimes
Public bug reported:
* Version:
master branch head as of 11 July 2018
https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055
* Summary:
When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM,
linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova.
---
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] raise exception.VirtualInterfaceCreateException()
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]
---
* Reproducing:
Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step
---
let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM.
0. Previous device info is like following
{'added': set(),
'current': set("tapA"),
'updated': set(),
'removed': set(),
'timestamps': {"tapA": 1}}
1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377
-> current_devices is "tapA"
__ disappeared tapA due to rebuilding VM __
2. Get timestamp(interface index in the case of linuxbridge-agent) https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395
-> current timestamp is {"tapA": None}. this is because we failed to get interface information here
3. Check device locally changed or not
https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397
-> locally "tapA" is detected as locally changed device. because timestamp information is change from before (1 != None)
4. Generate device_info like following
{'added': set(),
'current': set("tapA"),
'updated': set("tapA"),
'removed': set(),
'timestamps': {"tapA": None}}
5. Process linuxbridge-agent interface plugging logic for tapA, but
checking device existence failed because there is no such a device. here
note even if check for device existence failed, this function won't
raise exception and re-sync won't happen
https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531
-- appeared tapA again due to rebuilding VM --
-- next scan_device iteration start --
6. Get current_devices
-> current_devices is "tapA"
7. Get timestamp
-> current timestamp is {"tapA":2}.
8. Check device locally changed or not
-> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369
9. Generate device_info like following
{'added': set(),
'current': set("tapA"),
'updated': set(),
'removed': set(),
'timestamps': {"tapA": 2}}
next iteration is expected to detect device updated but didn't in this case.
So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though)
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1781129
Title:
linuxbridge-agent missed updated device sometimes
Status in neutron:
New
Bug description:
* Version:
master branch head as of 11 July 2018
https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055
* Summary:
When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM,
linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova.
---
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] raise exception.VirtualInterfaceCreateException()
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]
---
* Reproducing:
Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step
---
let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM.
0. Previous device info is like following
{'added': set(),
'current': set("tapA"),
'updated': set(),
'removed': set(),
'timestamps': {"tapA": 1}}
1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377
-> current_devices is "tapA"
__ disappeared tapA due to rebuilding VM __
2. Get timestamp(interface index in the case of linuxbridge-agent) https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395
-> current timestamp is {"tapA": None}. this is because we failed to get interface information here
3. Check device locally changed or not
https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397
-> locally "tapA" is detected as locally changed device. because timestamp information is change from before (1 != None)
4. Generate device_info like following
{'added': set(),
'current': set("tapA"),
'updated': set("tapA"),
'removed': set(),
'timestamps': {"tapA": None}}
5. Process linuxbridge-agent interface plugging logic for tapA, but
checking device existence failed because there is no such a device.
here note even if check for device existence failed, this function
won't raise exception and re-sync won't happen
https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531
-- appeared tapA again due to rebuilding VM --
-- next scan_device iteration start --
6. Get current_devices
-> current_devices is "tapA"
7. Get timestamp
-> current timestamp is {"tapA":2}.
8. Check device locally changed or not
-> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369
9. Generate device_info like following
{'added': set(),
'current': set("tapA"),
'updated': set(),
'removed': set(),
'timestamps': {"tapA": 2}}
next iteration is expected to detect device updated but didn't in this case.
So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though)
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1781129/+subscriptions
Follow ups