← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2084446] [NEW] Scaling down neutron-openvswitch-agent don't remote it's tunnel endpoints

 

Public bug reported:

When scalling down node with neutron L2 agent (I tested it with neutron-
openvswitch-agent) it is not cleaning it's tunnel endpoints from the db
- for vxlan tunnels it is entry in "ml2_vxlan_endpoints" table but every
tunneling has own table there.

Additional issue is that even if I removed manually endpoint entry from
that table, other running agents still kept tunnel to that endpoint in
their br-tun bridge, even after I restarted such agent. To understand
exactly what the issue is here is what I did step by step:

1. Deployed multinode devstack with compute-1 and compute-2 nodes,
2. Tunnels in br-tun were created by the neutron-openvswitch-agent on both nodes,
3. I stopped neutron-openvswitch-agent on compute-1 node and then I delete it from neutron db with API command "openstack network agent delete <agent_id>"
4. On compute-2 there was still tunnel to the compute-1 created in br-tun,
5. In the neutron db in "ml2_vxlan_endpoints" table there was still endpoint to the compute-1,
6. I manually removed endpoint from the "ml2_vxlan_endoints" table in db using query: "DELETE FROM ml2_vxlan_endpoints WHERE host='devstack-ubuntu-compute-1';"
7. I restarted neutron-openvswitch-agent on compute-2 but even after that tunnel to the compute-1 was still there,
8. To get rid of the stale endpoint to the compute-1 on compute-2 I had to delete br-tun and then restart neutron-openvswitch-agent on compute-2

This is usually not a big issue if that tunnel is not cleaned but in
some cases it may cause serious problem. For example if it is scaling
down networker nodes in the cluster with L3 ha used, it may happen that
old node is removed from the openstack cluster but for some reason still
up and running in the datacenter. In such case keepalived processes for
some HA routers may still be running there and as it has still
connectivity through the vxlan tunnels to the new networker nodes, it
may happen that active keepalived node will be this old one causing that
in the neutron API router will be visible as 'standby' on all known L3
agents.

** Affects: neutron
     Importance: Medium
         Status: Confirmed


** Tags: ovs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2084446

Title:
  Scaling down neutron-openvswitch-agent don't remote it's tunnel
  endpoints

Status in neutron:
  Confirmed

Bug description:
  When scalling down node with neutron L2 agent (I tested it with
  neutron-openvswitch-agent) it is not cleaning it's tunnel endpoints
  from the db - for vxlan tunnels it is entry in "ml2_vxlan_endpoints"
  table but every tunneling has own table there.

  Additional issue is that even if I removed manually endpoint entry
  from that table, other running agents still kept tunnel to that
  endpoint in their br-tun bridge, even after I restarted such agent. To
  understand exactly what the issue is here is what I did step by step:

  1. Deployed multinode devstack with compute-1 and compute-2 nodes,
  2. Tunnels in br-tun were created by the neutron-openvswitch-agent on both nodes,
  3. I stopped neutron-openvswitch-agent on compute-1 node and then I delete it from neutron db with API command "openstack network agent delete <agent_id>"
  4. On compute-2 there was still tunnel to the compute-1 created in br-tun,
  5. In the neutron db in "ml2_vxlan_endpoints" table there was still endpoint to the compute-1,
  6. I manually removed endpoint from the "ml2_vxlan_endoints" table in db using query: "DELETE FROM ml2_vxlan_endpoints WHERE host='devstack-ubuntu-compute-1';"
  7. I restarted neutron-openvswitch-agent on compute-2 but even after that tunnel to the compute-1 was still there,
  8. To get rid of the stale endpoint to the compute-1 on compute-2 I had to delete br-tun and then restart neutron-openvswitch-agent on compute-2

  This is usually not a big issue if that tunnel is not cleaned but in
  some cases it may cause serious problem. For example if it is scaling
  down networker nodes in the cluster with L3 ha used, it may happen
  that old node is removed from the openstack cluster but for some
  reason still up and running in the datacenter. In such case keepalived
  processes for some HA routers may still be running there and as it has
  still connectivity through the vxlan tunnels to the new networker
  nodes, it may happen that active keepalived node will be this old one
  causing that in the neutron API router will be visible as 'standby' on
  all known L3 agents.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2084446/+subscriptions