← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1863110] [NEW] 2/3 snat namespace transitions to master

 

Public bug reported:

neutron version: 14.0.2
general deployment version: stein
deployment method: kolla-ansible
neutron configuration:
 - l3 = ha
 - agent_mode = dvr_snat
 - ovs
general info: multi node deployment, ca ~100 computes

when spawning larger heat stacks with multiple instances (think k8s
infrastructure) sometimes (roughly 50%) we get a "split brain" on snat
namespaces.

logs looks like this on one of the three controller/network nodes.

11:53:43.402    Handling notification for router 2a218a31-2ef6-406a-a719-17965600e182, state master 11:53:43.403	enqueue /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha.py:50
Router 2a218a31-2ef6-406a-a719-17965600e182 transitioned to master

and then this happens on another of the three controller/network nodes.

11:53:57.582	Handling notification for router 2a218a31-2ef6-406a-a719-17965600e182, state master enqueue /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha.py:50
11:53:57.583	Router 2a218a31-2ef6-406a-a719-17965600e182 transitioned to master

so neutron sets up all routes in both controller nodes and wrecks havoc on session that instances are creating to the outside. obviously deleting the routes from the faulty namespace solves the issue.
i can't really find the reason for it being promoted to master even when looking through the debug logs. would greatly appreciate any helpful pointers.
the only thing i can think of is some kind of race condition happening and therefor everything in neutron looks fine.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1863110

Title:
  2/3 snat namespace transitions to master

Status in neutron:
  New

Bug description:
  neutron version: 14.0.2
  general deployment version: stein
  deployment method: kolla-ansible
  neutron configuration:
   - l3 = ha
   - agent_mode = dvr_snat
   - ovs
  general info: multi node deployment, ca ~100 computes

  when spawning larger heat stacks with multiple instances (think k8s
  infrastructure) sometimes (roughly 50%) we get a "split brain" on snat
  namespaces.

  logs looks like this on one of the three controller/network nodes.

  11:53:43.402    Handling notification for router 2a218a31-2ef6-406a-a719-17965600e182, state master 11:53:43.403	enqueue /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha.py:50
  Router 2a218a31-2ef6-406a-a719-17965600e182 transitioned to master

  and then this happens on another of the three controller/network
  nodes.

  11:53:57.582	Handling notification for router 2a218a31-2ef6-406a-a719-17965600e182, state master enqueue /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/agent/l3/ha.py:50
  11:53:57.583	Router 2a218a31-2ef6-406a-a719-17965600e182 transitioned to master

  so neutron sets up all routes in both controller nodes and wrecks havoc on session that instances are creating to the outside. obviously deleting the routes from the faulty namespace solves the issue.
  i can't really find the reason for it being promoted to master even when looking through the debug logs. would greatly appreciate any helpful pointers.
  the only thing i can think of is some kind of race condition happening and therefor everything in neutron looks fine.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1863110/+subscriptions