← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1499488] [NEW] Race condition puts ovs agent in resync

 

Public bug reported:

The following code is from
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent.treat_devices_added_or_updated():

        devices_details_list = (
            self.plugin_rpc.get_devices_details_list_and_failed_devices(
                self.context,
                devices,
                self.agent_id,
                self.conf.host))
        if devices_details_list.get('failed_devices'):
            #TODO(rossella_s) handle better the resync in next patches,
            # this is just to preserve the current behavior
            raise DeviceListRetrievalError(devices=devices)

        devices = devices_details_list.get('devices')
        vif_by_id = self.int_br.get_vifs_by_ids(
            [vif['device'] for vif in devices])

The race condition comes in between
get_devices_details_list_and_failed_devices() and get_vifs_by_ids().  If
a VM is deleted in that time, then the OVS port goes away and
get_vifs_by_ids() raises an exception, which bumps us out to the
exception handler in rpc_loop and puts us in resync, causing the next
rpc_loop to rescan ALL ports.  On a highly scaled system, this resync
can take many minutes, in which time new plug requests all timeout.

get_vifs_by_ids() was added under this patch:
https://review.openstack.org/#/c/186734/

The reason the exception is raised due to the missing port is because
this new get_vifs_by_id method is not passing if_exists=True on the call
to get_ports_attributes().  A grep within that file shows every other
call to get_ports_attributes passing if_exists=True.

I believe the fix is to simply start passing if_exists=True in
get_vifs_by_ids.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1499488

Title:
  Race condition puts ovs agent in resync

Status in neutron:
  New

Bug description:
  The following code is from
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent.treat_devices_added_or_updated():

          devices_details_list = (
              self.plugin_rpc.get_devices_details_list_and_failed_devices(
                  self.context,
                  devices,
                  self.agent_id,
                  self.conf.host))
          if devices_details_list.get('failed_devices'):
              #TODO(rossella_s) handle better the resync in next patches,
              # this is just to preserve the current behavior
              raise DeviceListRetrievalError(devices=devices)

          devices = devices_details_list.get('devices')
          vif_by_id = self.int_br.get_vifs_by_ids(
              [vif['device'] for vif in devices])

  The race condition comes in between
  get_devices_details_list_and_failed_devices() and get_vifs_by_ids().
  If a VM is deleted in that time, then the OVS port goes away and
  get_vifs_by_ids() raises an exception, which bumps us out to the
  exception handler in rpc_loop and puts us in resync, causing the next
  rpc_loop to rescan ALL ports.  On a highly scaled system, this resync
  can take many minutes, in which time new plug requests all timeout.

  get_vifs_by_ids() was added under this patch:
  https://review.openstack.org/#/c/186734/

  The reason the exception is raised due to the missing port is because
  this new get_vifs_by_id method is not passing if_exists=True on the
  call to get_ports_attributes().  A grep within that file shows every
  other call to get_ports_attributes passing if_exists=True.

  I believe the fix is to simply start passing if_exists=True in
  get_vifs_by_ids.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1499488/+subscriptions


Follow ups