yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #89140
[Bug 1979089] [NEW] l3ha router delete race condition if state changes to master
Public bug reported:
We have hit an issue whereby nodes running Neutron Ussuri with ML2 L3HA
are occasionally running into the following:
On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent
sometimes gets into a state where it is repeatedly trying to spawn a
metadata proxy (haproxy) for a router that no longer exists and fails
because the namespace is no longer there. This happens thousands of
times a day and basically blocks the l3-agent from processing other
updates. The error looks like:
2022-05-21 06:26:12.882 30127 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.
conf', 'ip', 'netns', 'exec', 'qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/57837a95-ed3b
-4a1b-9393-1374a8c744c3.conf'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88
2022-05-21 06:26:13.116 30127 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3": No such file or directory
Some background; when the l3-agent starts is subscribes callback methods
to certain events. One of those events is "before_delete" [1] and the
method is neutron.agent.metadata.driver.before_router_removed. The idea
here is that when an update to delete a router is received it will first
execute this method which will delete the "neutron metadata-proxy
monitor" which looks at the haproxy pid and respawns it if it dies.
A successful callback execution looks like:
2022-05-21 03:05:54.676 30127 DEBUG neutron_lib.callbacks.manager [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-21 03:05:54.676 30127 DEBUG neutron.agent.linux.utils [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-15', '26363'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88
And an unsuccessful one looks like:
2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 57837a95-ed3b-4a1b-9393-1374a8c744c3 transitioned to master on agent sgdemr0114bp007
...
2022-05-10 23:36:10.646 30127 DEBUG neutron_lib.callbacks.manager [req-6bfaa057-0ab9-450c-b27f-d4008fd7f9f1 - a87539ab4d2e4423b28ae6634e0d9c25 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 57837a95-ed3b-4a1b-9393-1374a8c744c3 _update_metadata_proxy /usr/lib/python3/dist-packages/neutron/agent/l3/ha.py:219
The difference being that instead of killing the proxy monitor it is
actually spawning a new one! The other thing to notice is that in the
second case it isn't servicing the same request (it has "[-]"). I looked
at the code and found that this is because when the router transitions
to master the l3-agent puts the thread to eventlet.sleep(2) [2] and then
proceeds with the update i.e. spawning the metadata proxy and while it
was asleep it started to process the before_delete callback but then got
pre-empted.
So this looks like a simple race condition and occurs if a router
transitions to master at the same time as being deleted.
A simple interim workaround is to manually create the non-existent
namespace which will allow the respawn to succeed and then hopefully the
callback gets to run and deletes it again to clean up. That or restart
your neutron-l3-agent service.
[1] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/metadata/driver.py#L186
[2] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1979089
Title:
l3ha router delete race condition if state changes to master
Status in neutron:
New
Bug description:
We have hit an issue whereby nodes running Neutron Ussuri with ML2
L3HA are occasionally running into the following:
On a 3 node neutron-gateway with approx 280 HA routers, the l3-agent
sometimes gets into a state where it is repeatedly trying to spawn a
metadata proxy (haproxy) for a router that no longer exists and fails
because the namespace is no longer there. This happens thousands of
times a day and basically blocks the l3-agent from processing other
updates. The error looks like:
2022-05-21 06:26:12.882 30127 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.
conf', 'ip', 'netns', 'exec', 'qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/57837a95-ed3b
-4a1b-9393-1374a8c744c3.conf'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88
2022-05-21 06:26:13.116 30127 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-57837a95-ed3b-4a1b-9393-1374a8c744c3": No such file or directory
Some background; when the l3-agent starts is subscribes callback
methods to certain events. One of those events is "before_delete" [1]
and the method is neutron.agent.metadata.driver.before_router_removed.
The idea here is that when an update to delete a router is received it
will first execute this method which will delete the "neutron
metadata-proxy monitor" which looks at the haproxy pid and respawns it
if it dies.
A successful callback execution looks like:
2022-05-21 03:05:54.676 30127 DEBUG neutron_lib.callbacks.manager [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-21 03:05:54.676 30127 DEBUG neutron.agent.linux.utils [req-4eba076a-fcd8-41d5-bfd0-4d0af62aca40 - dd773b0f26da469d85d2a825fa794863 - - -] Running command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-15', '26363'] create_process /usr/lib/python3/dist-packages/neutron/agent/linux/utils.py:88
And an unsuccessful one looks like:
2022-05-10 23:36:10.480 30127 INFO neutron.agent.l3.ha [-] Router 57837a95-ed3b-4a1b-9393-1374a8c744c3 transitioned to master on agent sgdemr0114bp007
...
2022-05-10 23:36:10.646 30127 DEBUG neutron_lib.callbacks.manager [req-6bfaa057-0ab9-450c-b27f-d4008fd7f9f1 - a87539ab4d2e4423b28ae6634e0d9c25 - - -] Notify callbacks ['neutron.agent.metadata.driver.before_router_removed--9223363255257968124'] for router, before_delete _notify_loop /usr/lib/python3/dist-packages/neutron_lib/callbacks/manager.py:193
2022-05-10 23:36:10.853 30127 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 57837a95-ed3b-4a1b-9393-1374a8c744c3 _update_metadata_proxy /usr/lib/python3/dist-packages/neutron/agent/l3/ha.py:219
The difference being that instead of killing the proxy monitor it is
actually spawning a new one! The other thing to notice is that in the
second case it isn't servicing the same request (it has "[-]"). I
looked at the code and found that this is because when the router
transitions to master the l3-agent puts the thread to
eventlet.sleep(2) [2] and then proceeds with the update i.e. spawning
the metadata proxy and while it was asleep it started to process the
before_delete callback but then got pre-empted.
So this looks like a simple race condition and occurs if a router
transitions to master at the same time as being deleted.
A simple interim workaround is to manually create the non-existent
namespace which will allow the respawn to succeed and then hopefully
the callback gets to run and deletes it again to clean up. That or
restart your neutron-l3-agent service.
[1] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/metadata/driver.py#L186
[2] https://github.com/openstack/neutron/blob/52bb040e4e21b9db7e9787cec8ac86de5644eadb/neutron/agent/l3/ha.py#L149
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1979089/+subscriptions