← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1979089] [NEW] l3ha router delete race condition if state changes to master

 

Public bug reported:

We have hit an issue whereby nodes running Neutron Ussuri with ML2 L3HA
are occasionally running into the following:

On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent
sometimes gets into a state where it is repeatedly trying to spawn a
metadata proxy (haproxy) for a router that no longer exists and fails
because the namespace is no longer there. This happens thousands of
times a day and basically blocks the l3-agent from processing other
updates. The error looks like:

2022-05-21 06:26:12.882 30127 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.
conf', 'ip', 'netns', 'exec', 'qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/57837a95-ed3b
-4a1b-9393-1374a8c744c3.conf'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88
2022-05-21 06:26:13.116 30127 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3": No such file or directory

Some background; when the l3-agent starts is subscribes callback methods
to certain events. One of those events is "before_delete" [1] and the
method is neutron.agent.metadata.driver.before_router_removed. The idea
here is that when an update to delete a router is received it will first
execute this method which will delete the "neutron metadata-proxy
monitor" which looks at the haproxy pid and respawns it if it dies.

A successful callback execution looks like:

2022-05-21 03:05:54.676 30127 DEBUG neutron_lib.callbacks.manager [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-21 03:05:54.676 30127 DEBUG neutron.agent.linux.utils [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-15', '26363'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88

And an unsuccessful one looks like:

2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 57837a95-ed3b-4a1b-9393-1374a8c744c3 transitioned to master on agent sgdemr0114bp007
...
2022-05-10 23:36:10.646 30127 DEBUG neutron_lib.callbacks.manager [req-6bfaa057-0ab9-450c-b27f-d4008fd7f9f1 - a87539ab4d2e4423b28ae6634e0d9c25 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 57837a95-ed3b-4a1b-9393-1374a8c744c3 _update_metadata_proxy /usr/lib/python3/dist-packages/neutron/agent/l3/ha.py:219

The difference being that instead of killing the proxy monitor it is
actually spawning a new one! The other thing to notice is that in the
second case it isn't servicing the same request (it has "[-]"). I looked
at the code and found that this is because when the router transitions
to master the l3-agent puts the thread to eventlet.sleep(2) [2] and then
proceeds with the update i.e. spawning the metadata proxy and while it
was asleep it started to process the before_delete callback but then got
pre-empted.

So this looks like a simple race condition and occurs if a router
transitions to master at the same time as being deleted.

A simple interim workaround is to manually create the non-existent
namespace which will allow the respawn to succeed and then hopefully the
callback gets to run and deletes it again to clean up. That or restart
your neutron-l3-agent service.

[1] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/metadata/driver.py#L186
[2] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1979089

Title:
  l3ha router delete race condition if state changes to master

Status in neutron:
  New

Bug description:
  We have hit an issue whereby nodes running Neutron Ussuri with ML2
  L3HA are occasionally running into the following:

  On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent
  sometimes gets into a state where it is repeatedly trying to spawn a
  metadata proxy (haproxy) for a router that no longer exists and fails
  because the namespace is no longer there. This happens thousands of
  times a day and basically blocks the l3-agent from processing other
  updates. The error looks like:

  2022-05-21 06:26:12.882 30127 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.
  conf', 'ip', 'netns', 'exec', 'qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/57837a95-ed3b
  -4a1b-9393-1374a8c744c3.conf'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88
  2022-05-21 06:26:13.116 30127 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3": No such file or directory

  Some background; when the l3-agent starts is subscribes callback
  methods to certain events. One of those events is "before_delete" [1]
  and the method is neutron.agent.metadata.driver.before_router_removed.
  The idea here is that when an update to delete a router is received it
  will first execute this method which will delete the "neutron
  metadata-proxy monitor" which looks at the haproxy pid and respawns it
  if it dies.

  A successful callback execution looks like:

  2022-05-21 03:05:54.676 30127 DEBUG neutron_lib.callbacks.manager [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
  2022-05-21 03:05:54.676 30127 DEBUG neutron.agent.linux.utils [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-15', '26363'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88

  And an unsuccessful one looks like:

  2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 57837a95-ed3b-4a1b-9393-1374a8c744c3 transitioned to master on agent sgdemr0114bp007
  ...
  2022-05-10 23:36:10.646 30127 DEBUG neutron_lib.callbacks.manager [req-6bfaa057-0ab9-450c-b27f-d4008fd7f9f1 - a87539ab4d2e4423b28ae6634e0d9c25 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
  2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 57837a95-ed3b-4a1b-9393-1374a8c744c3 _update_metadata_proxy /usr/lib/python3/dist-packages/neutron/agent/l3/ha.py:219

  The difference being that instead of killing the proxy monitor it is
  actually spawning a new one! The other thing to notice is that in the
  second case it isn't servicing the same request (it has "[-]"). I
  looked at the code and found that this is because when the router
  transitions to master the l3-agent puts the thread to
  eventlet.sleep(2) [2] and then proceeds with the update i.e. spawning
  the metadata proxy and while it was asleep it started to process the
  before_delete callback but then got pre-empted.

  So this looks like a simple race condition and occurs if a router
  transitions to master at the same time as being deleted.

  A simple interim workaround is to manually create the non-existent
  namespace which will allow the respawn to succeed and then hopefully
  the callback gets to run and deletes it again to clean up. That or
  restart your neutron-l3-agent service.

  [1] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/metadata/driver.py#L186
  [2] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1979089/+subscriptions