← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1818015] [NEW] VLAN manager removed external port mapping when it was still in use

 

Public bug reported:

A production Queens DVR deployment (12.0.3-0ubuntu1~cloud0) erroneously
cleaned up the VLAN/binding for an external network (used by multiple
ports, generally for routers) that was still in use. This occurred on
all hyper-visors at around the same time.

2019-02-07 03:56:58.273 14197 INFO
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-
71ccf801-d722-4196-a1d7-4924953939d8 - - - - -] Reclaiming vlan = 10
from net-id = fa2c3b23-5f25-4ab1-b06b-6edc405ec323

This broke traffic flow for the remaining router using this port. After
restarting neutron-openvswitch-agent it claimed the port was updated,
and then re-added the mapping and traffic flowed again.

Unfortunately I don't have good details on what caused this situation to
occur, and do not have a reproduction case. My hope is to analyse the
theoretical situation for what may have led to this.

This is a "reasonable" size cloud with 10 compute hosts, 100s of
instances, 56 routers.

A few details that I do have:
 - It seems that multiple neutron ports were being deleted at the time across the cloud. The one event I can notice from the hypervisor's auth.log is that a floating IP on that same network was removed within the minute prior. I am not really sure if that was itself specifically related. Unfortunately I do not have the corresponding neutron-api logs from that same time period.

My hope is to analyse the theoretical situation for how it may occur
that the vlan manager loses track of multiple users of the port. In such
a way that also caused that to happen consistently across all HVs.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1818015

Title:
  VLAN manager removed external port mapping when it was still in use

Status in neutron:
  New

Bug description:
  A production Queens DVR deployment (12.0.3-0ubuntu1~cloud0)
  erroneously cleaned up the VLAN/binding for an external network (used
  by multiple ports, generally for routers) that was still in use. This
  occurred on all hyper-visors at around the same time.

  2019-02-07 03:56:58.273 14197 INFO
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-
  71ccf801-d722-4196-a1d7-4924953939d8 - - - - -] Reclaiming vlan = 10
  from net-id = fa2c3b23-5f25-4ab1-b06b-6edc405ec323

  This broke traffic flow for the remaining router using this port.
  After restarting neutron-openvswitch-agent it claimed the port was
  updated, and then re-added the mapping and traffic flowed again.

  Unfortunately I don't have good details on what caused this situation
  to occur, and do not have a reproduction case. My hope is to analyse
  the theoretical situation for what may have led to this.

  This is a "reasonable" size cloud with 10 compute hosts, 100s of
  instances, 56 routers.

  A few details that I do have:
   - It seems that multiple neutron ports were being deleted at the time across the cloud. The one event I can notice from the hypervisor's auth.log is that a floating IP on that same network was removed within the minute prior. I am not really sure if that was itself specifically related. Unfortunately I do not have the corresponding neutron-api logs from that same time period.

  My hope is to analyse the theoretical situation for how it may occur
  that the vlan manager loses track of multiple users of the port. In
  such a way that also caused that to happen consistently across all
  HVs.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1818015/+subscriptions


Follow ups