← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1781129] [NEW] linuxbridge-agent missed updated device sometimes

 

Public bug reported:

* Version: 
master branch head as of 11 July 2018
https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055

* Summary:
When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM,
linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova.  

---
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]     raise exception.VirtualInterfaceCreateException()
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed
2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]
---

* Reproducing:
Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step

---
let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM.  

0. Previous device info is like following
                       {'added': set(),
                        'current': set("tapA"),
                        'updated': set(),
                        'removed': set(),
                        'timestamps': {"tapA": 1}}

1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377
->  current_devices is "tapA" 

__ disappeared tapA due to rebuilding VM __

2. Get timestamp(interface index in the case of linuxbridge-agent)  https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395
->  current timestamp is {"tapA": None}. this is because we failed to get interface information here

3. Check device locally changed or not
https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397
->  locally "tapA" is detected  as locally changed device. because timestamp information is change from before (1 != None)

4. Generate device_info like following
                       {'added': set(),
                        'current': set("tapA"),
                        'updated': set("tapA"),
                        'removed': set(),
                        'timestamps': {"tapA": None}}

5. Process linuxbridge-agent interface plugging logic for tapA, but
checking device existence failed because there is no such a device. here
note even if check for device existence failed, this function won't
raise exception and re-sync won't happen
https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531

-- appeared tapA again due to rebuilding VM --
-- next scan_device iteration start --

6. Get current_devices 
->  current_devices is "tapA" 

7. Get timestamp 
-> current timestamp is {"tapA":2}. 

8. Check device locally changed or not 
-> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369

9. Generate device_info like following
                       {'added': set(),
                        'current': set("tapA"),
                        'updated': set(),
                        'removed': set(),
                        'timestamps': {"tapA": 2}}

next iteration is expected to detect device updated but didn't in this case.
So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though)

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1781129

Title:
  linuxbridge-agent missed updated device sometimes

Status in neutron:
  New

Bug description:
  * Version: 
  master branch head as of 11 July 2018
  https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055

  * Summary:
  When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM,
  linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova.  

  ---
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network
  2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]     raise exception.VirtualInterfaceCreateException()
  2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed
  2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]
  ---

  * Reproducing:
  Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step

  ---
  let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM.  

  0. Previous device info is like following
                         {'added': set(),
                          'current': set("tapA"),
                          'updated': set(),
                          'removed': set(),
                          'timestamps': {"tapA": 1}}

  1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377
  ->  current_devices is "tapA" 

  __ disappeared tapA due to rebuilding VM __

  2. Get timestamp(interface index in the case of linuxbridge-agent)  https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395
  ->  current timestamp is {"tapA": None}. this is because we failed to get interface information here

  3. Check device locally changed or not
  https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397
  ->  locally "tapA" is detected  as locally changed device. because timestamp information is change from before (1 != None)

  4. Generate device_info like following
                         {'added': set(),
                          'current': set("tapA"),
                          'updated': set("tapA"),
                          'removed': set(),
                          'timestamps': {"tapA": None}}

  5. Process linuxbridge-agent interface plugging logic for tapA, but
  checking device existence failed because there is no such a device.
  here note even if check for device existence failed, this function
  won't raise exception and re-sync won't happen
  https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531

  -- appeared tapA again due to rebuilding VM --
  -- next scan_device iteration start --

  6. Get current_devices 
  ->  current_devices is "tapA" 

  7. Get timestamp 
  -> current timestamp is {"tapA":2}. 

  8. Check device locally changed or not 
  -> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369

  9. Generate device_info like following
                         {'added': set(),
                          'current': set("tapA"),
                          'updated': set(),
                          'removed': set(),
                          'timestamps': {"tapA": 2}}

  next iteration is expected to detect device updated but didn't in this case.
  So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though)

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1781129/+subscriptions


Follow ups