← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1781129] Re: linuxbridge-agent missed updated device sometimes

 

Reviewed:  https://review.openstack.org/581648
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b803195a9979f7b3b0fd9cea41699b33fc8cf2bb
Submitter: Zuul
Branch:    master

commit b803195a9979f7b3b0fd9cea41699b33fc8cf2bb
Author: Yuki Nishiwaki <uckey.1067@xxxxxxxxx>
Date:   Wed Jul 25 16:05:30 2018 +0900

    Dont use dict.get() to know certain key is in dict
    
    In CommonAgentLoop class, there is logic to detect tap device is changed
    locally or not by comparing timestamp with previous.
    Sometimes timestamp value could be None depending on the timing (see bug/1781129)
    
    But current _get_devices_locally_modified logic can not detect local
    change from None to something because _get_devices_locally_modified
    function don't always compare if previous timestamp value was None.
    
    In order not to miss updated device always, better not to use dict.get() to
    know previous iteration have timestamp or not.
    
    Change-Id: Ib0361ad5c281f88558e8e048cfec588b9f9b1de4
    Closes-Bug: #1781129


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1781129

Title:
  linuxbridge-agent missed updated device sometimes

Status in neutron:
  Fix Released

Bug description:
  * Version: 
  master branch head as of 11 July 2018
  https://github.com/openstack/neutron/commit/5db397958160ea3dc952794fb7f6ec68a2da2055

  * Summary:
  When the operation that make tap interface disappeared/appeared in short interval executed like rebuilding VM,
  linuxbridge-agent can miss updated device events depending on when tap device disappeared. this cause eventually following "VirtualInterfaceCreateException" error in nova-compute because neutron-server didn't send vif_plugged event to Nova.  

  ---
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4946, in _create_domain_and_network
  2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]     raise exception.VirtualInterfaceCreateException()
  2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa] VirtualInterfaceCreateException: Virtual Interface creation failed
  2018-07-11 01:07:49.632 56453 ERROR nova.compute.manager [instance: a58186f4-3b2e-46ba-acfe-d24432d117aa]
  ---

  * Reproducing:
  Actually this is very difficult to reproduce because the pre-condition to make it reproduce strongly depending on the running state in linuxbridge-agent, so I'm gonna explain the state transition for the logic of detection to updated device step by step

  ---
  let's say Hypervisor have 1 tap device for 1 VM which is "tapA" and tapA's interface index is 1 and User just requested rebuilding this VM.  

  0. Previous device info is like following
                         {'added': set(),
                          'current': set("tapA"),
                          'updated': set(),
                          'removed': set(),
                          'timestamps': {"tapA": 1}}

  1. Get current_devices https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/agent/_common_agent.py#L377
  ->  current_devices is "tapA" 

  __ disappeared tapA due to rebuilding VM __

  2. Get timestamp(interface index in the case of linuxbridge-agent)  https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L395
  ->  current timestamp is {"tapA": None}. this is because we failed to get interface information here

  3. Check device locally changed or not
  https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L397
  ->  locally "tapA" is detected  as locally changed device. because timestamp information is change from before (1 != None)

  4. Generate device_info like following
                         {'added': set(),
                          'current': set("tapA"),
                          'updated': set("tapA"),
                          'removed': set(),
                          'timestamps': {"tapA": None}}

  5. Process linuxbridge-agent interface plugging logic for tapA, but
  checking device existence failed because there is no such a device.
  here note even if check for device existence failed, this function
  won't raise exception and re-sync won't happen
  https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py#L521-L531

  -- appeared tapA again due to rebuilding VM --
  -- next scan_device iteration start --

  6. Get current_devices 
  ->  current_devices is "tapA" 

  7. Get timestamp 
  -> current timestamp is {"tapA":2}. 

  8. Check device locally changed or not 
  -> no locally device is detected because of this line https://github.com/openstack/neutron/blob/5db397958160ea3dc952794fb7f6ec68a2da2055/neutron/plugins/ml2/drivers/agent/_common_agent.py#L369

  9. Generate device_info like following
                         {'added': set(),
                          'current': set("tapA"),
                          'updated': set(),
                          'removed': set(),
                          'timestamps': {"tapA": 2}}

  next iteration is expected to detect device updated but didn't in this case.
  So we have to improve this locally changed device detection logic. otherwise rebooting/rebuilding operation would fail sometimes(it's really rare case though)

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1781129/+subscriptions


References