← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1681979] [NEW] L2pop flows are lost after OVS agent restart

 

Public bug reported:

In OVS agent, there is a race condition between l2pop's add_fdb_entries
notification and provision_local_vlan when we create a vlanmanager
mapping. This results in either unicast, flooding, or both entries not
being populated on the host. Without the flooding entries, connectivity
is lost.

They are lost semi-permanently after this as l2Pop mechanism driver only
sends full list of fdb entries after a port_update_up, but only on 1st
agent port, or after OVS reboot (where we again hit same race condition,
or it partially fixed flows).

Legacy testbed w/ 3 nodes. 4 tenant networks:

1. The add_fdb_entries code path will create the tunnel port(s) in
add_fdb_tun, then invoke add_fdb_flow to add the BC/UC l2pop flows and -
but only if it can get a Vlanmanager mapping:

    def fdb_add(self, context, fdb_entries):
        LOG.info("2. fdb_add received")
        for lvm, agent_ports in self.get_agent_ports(fdb_entries):
            agent_ports.pop(self.local_ip, None)
            LOG.info("2. fdb_add: agent_ports = %s", agent_ports)
            LOG.info("2. fdb_add: lvm = %s", lvm)
            if len(agent_ports):
                if not self.enable_distributed_routing:
                    with self.tun_br.deferred() as deferred_br:
                        LOG.info("2. fdb_add: about to call fdb_add_tun w/ lvm = %s", lvm)
                        self.fdb_add_tun(context, deferred_br, lvm,
                                         agent_ports, self._tunnel_port_lookup)
                else:
                    self.fdb_add_tun(context, self.tun_br, lvm,
                                     agent_ports, self._tunnel_port_lookup)

    def get_agent_ports(self, fdb_entries, local_vlan_map=None):
        """Generator to yield port info.

        For each known (i.e found in VLAN manager) network in
        fdb_entries, yield (lvm, fdb_entries[network_id]['ports']) pair.

        :param fdb_entries: l2pop fdb entries
        :param local_vlan_map: Deprecated.
        """
        lvm_getter = self._get_lvm_getter(local_vlan_map)
        for network_id, values in fdb_entries.items():
            try:
                lvm = lvm_getter(network_id, local_vlan_map)
            except vlanmanager.MappingNotFound:
                LOG.info("get_agent_ports: vlanmanager.MappingNotFound EXCEPTION! netid = %s, local_vlan_map = %s", network_id, local_vlan_map)
                continue
            agent_ports = values.get('ports')
            LOG.info("get_agent_ports: got lvm= %s", lvm)
            yield (lvm, agent_ports)


2. If the vlan mapping isn't found, the tunnel port creation is skipped, as are flows. 

3. When we create VLAN mapping in provision_local_vlan(), the
install_flood_to_tun however is skipped if there are currently no tunnel
ports created:

    def provision_local_vlan(self, net_uuid, network_type, physical_network,
                             segmentation_id):
...

        if network_type in constants.TUNNEL_NETWORK_TYPES:
            LOG.info("ARJUN: network_type = %s", network_type)
            if self.enable_tunneling:
                # outbound broadcast/multicast
                ofports = list(self.tun_br_ofports[network_type].values())
                LOG.info("ARJUN: provision_local_vlan: ofports = %s enable_tunneling = %s", ofports, self.enable_tunneling)
                if ofports:
                    LOG.info("ARJUN: installing FLOODING_ENTRY: lvid = %s segment_id = %s", lvid, segmentation_id)
                    self.tun_br.install_flood_to_tun(lvid,
                                                     segmentation_id,
                                                     ofports)
                # inbound from tunnels: set lvid in the right table
                # and resubmit to Table LEARN_FROM_TUN for mac learning

4. Finally, the cleanup stale flows logic removes all old flows. At this
point br-tun is left with missing flooding and/or unicast flows.


5. If #3 always happens first for all networks, we are good. Otherwise flows are lost:


Unicast only flows missing if (but flood added):

 - Network Vlanmanager mapping is allocated *after* it's
add_fdb_entries, but some other network sets up tunnel ports on br-tun

Broadcast AND UC flows missing if:

 - A network tries to add fdb flows before vlanmanager allocated, and no
other network has created the tunnel ports/ofports on br-tun yet.


Example with 3 tenant networks:

1. add_fdb_entries for network 1 and 2 - no LVM yet, so flow and tunnel ports not created yet
2. LVM created for network 2, but flood not installed because no ofports
3. LVM created for networks 3
4. add_fdb_entries for network 3, here it properly finds the LVM, and creates tunnel ports/flows
5. LVM created for network 1, tunnel ofports created, so flood installed - but unicast missing

After this point, network 3 would be fine, network 2 would me missing
all flows, network 1 would have flood but not unicast.

The ordering seems to vary wildly depending on # of tunnel ports, # of
networks, ports per network, how ports are distributed, network speed,
etc...

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1681979

Title:
  L2pop flows are lost after OVS agent restart

Status in neutron:
  New

Bug description:
  In OVS agent, there is a race condition between l2pop's
  add_fdb_entries notification and provision_local_vlan when we create a
  vlanmanager mapping. This results in either unicast, flooding, or both
  entries not being populated on the host. Without the flooding entries,
  connectivity is lost.

  They are lost semi-permanently after this as l2Pop mechanism driver
  only sends full list of fdb entries after a port_update_up, but only
  on 1st agent port, or after OVS reboot (where we again hit same race
  condition, or it partially fixed flows).

  Legacy testbed w/ 3 nodes. 4 tenant networks:

  1. The add_fdb_entries code path will create the tunnel port(s) in
  add_fdb_tun, then invoke add_fdb_flow to add the BC/UC l2pop flows and
  - but only if it can get a Vlanmanager mapping:

      def fdb_add(self, context, fdb_entries):
          LOG.info("2. fdb_add received")
          for lvm, agent_ports in self.get_agent_ports(fdb_entries):
              agent_ports.pop(self.local_ip, None)
              LOG.info("2. fdb_add: agent_ports = %s", agent_ports)
              LOG.info("2. fdb_add: lvm = %s", lvm)
              if len(agent_ports):
                  if not self.enable_distributed_routing:
                      with self.tun_br.deferred() as deferred_br:
                          LOG.info("2. fdb_add: about to call fdb_add_tun w/ lvm = %s", lvm)
                          self.fdb_add_tun(context, deferred_br, lvm,
                                           agent_ports, self._tunnel_port_lookup)
                  else:
                      self.fdb_add_tun(context, self.tun_br, lvm,
                                       agent_ports, self._tunnel_port_lookup)

      def get_agent_ports(self, fdb_entries, local_vlan_map=None):
          """Generator to yield port info.

          For each known (i.e found in VLAN manager) network in
          fdb_entries, yield (lvm, fdb_entries[network_id]['ports']) pair.

          :param fdb_entries: l2pop fdb entries
          :param local_vlan_map: Deprecated.
          """
          lvm_getter = self._get_lvm_getter(local_vlan_map)
          for network_id, values in fdb_entries.items():
              try:
                  lvm = lvm_getter(network_id, local_vlan_map)
              except vlanmanager.MappingNotFound:
                  LOG.info("get_agent_ports: vlanmanager.MappingNotFound EXCEPTION! netid = %s, local_vlan_map = %s", network_id, local_vlan_map)
                  continue
              agent_ports = values.get('ports')
              LOG.info("get_agent_ports: got lvm= %s", lvm)
              yield (lvm, agent_ports)

  
  2. If the vlan mapping isn't found, the tunnel port creation is skipped, as are flows. 

  3. When we create VLAN mapping in provision_local_vlan(), the
  install_flood_to_tun however is skipped if there are currently no
  tunnel ports created:

      def provision_local_vlan(self, net_uuid, network_type, physical_network,
                               segmentation_id):
  ...

          if network_type in constants.TUNNEL_NETWORK_TYPES:
              LOG.info("ARJUN: network_type = %s", network_type)
              if self.enable_tunneling:
                  # outbound broadcast/multicast
                  ofports = list(self.tun_br_ofports[network_type].values())
                  LOG.info("ARJUN: provision_local_vlan: ofports = %s enable_tunneling = %s", ofports, self.enable_tunneling)
                  if ofports:
                      LOG.info("ARJUN: installing FLOODING_ENTRY: lvid = %s segment_id = %s", lvid, segmentation_id)
                      self.tun_br.install_flood_to_tun(lvid,
                                                       segmentation_id,
                                                       ofports)
                  # inbound from tunnels: set lvid in the right table
                  # and resubmit to Table LEARN_FROM_TUN for mac learning

  4. Finally, the cleanup stale flows logic removes all old flows. At
  this point br-tun is left with missing flooding and/or unicast flows.

  
  5. If #3 always happens first for all networks, we are good. Otherwise flows are lost:

  
  Unicast only flows missing if (but flood added):

   - Network Vlanmanager mapping is allocated *after* it's
  add_fdb_entries, but some other network sets up tunnel ports on br-tun

  Broadcast AND UC flows missing if:

   - A network tries to add fdb flows before vlanmanager allocated, and
  no other network has created the tunnel ports/ofports on br-tun yet.


  Example with 3 tenant networks:

  1. add_fdb_entries for network 1 and 2 - no LVM yet, so flow and tunnel ports not created yet
  2. LVM created for network 2, but flood not installed because no ofports
  3. LVM created for networks 3
  4. add_fdb_entries for network 3, here it properly finds the LVM, and creates tunnel ports/flows
  5. LVM created for network 1, tunnel ofports created, so flood installed - but unicast missing

  After this point, network 3 would be fine, network 2 would me missing
  all flows, network 1 would have flood but not unicast.

  The ordering seems to vary wildly depending on # of tunnel ports, # of
  networks, ports per network, how ports are distributed, network speed,
  etc...

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1681979/+subscriptions