yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #95627
[Bug 2104979] [NEW] Possible deadlock on closing ip_monitor thread.
Public bug reported:
`ip_monitor` form `neutron/agent/linux/ip_lib.py` has high possibility
to deadlock during shutdown.
Code assumes EOFError exception will be thrown, which introduces race
condition.
During code analyze i found that both implementations
pyroute2.iproute.IPRoute and pyroute2.nslink.nslink.NetNS used by
ip_monitor during call of close sends `NetlinkError(104, 'Connection
reset by peer')` back to receiving thread. I believe that for proper
handling of event_stop, read_ip_updates thread should handle this error
instead of relaying on EOFError exception.
For tests i created test network namespace, and flapped link state of loopback interface (but error can occur without any changes inside namespace:
```
sudo ip netns create test
sudo ip netns exec test bash
while [ 1 ] ; do ip l s down lo ; sleep 1 ; ip l s up lo ; sleep 1; done
```
Fail result can be reproduced using neutron test:
```
sudo timeout 3 python neutron/tests/functional/agent/linux/bin/ip_monitor.py temp_file test
```
As for now i know that this affects neutron_l3_agent in ovs deployments
since 23.2.0 (we discovered it happens till commit
https://github.com/openstack/neutron/commit/a7aeec703de2b1db2849da206fa349037ce23a0e).
I assume changing of pidfile management broke some cleanup process that
handles deadlocked daemons.
I verified in our lab that neutron_l3_agent processes leak can by fixed by simple fix like:
```
diff --git a/neutron/agent/linux/ip_lib.py b/neutron/agent/linux/ip_lib.py
index c9b42d83dd..1c5274b5b5 100644
--- a/neutron/agent/linux/ip_lib.py
+++ b/neutron/agent/linux/ip_lib.py
@@ -1534,6 +1534,8 @@ def ip_monitor(namespace, queue, event_stop, event_started):
while True:
ip_addresses = _ip.get()
for ip_address in ip_addresses:
+ if ip_address.get('error') == errno.ECONNRESET:
+ return
LOG.debug("IP monitor %s; Adding IP address: %s "
"to the queue.", namespace, ip_address)
_queue.put(ip_address)
```
As of bug inside ip_monitor i reproduced it on all neutron versions
between 16.0.0 and 25.1.0 using pyroute2==0.8.1
I did not run tests on other platfroms than linux (fedora 41 and ubuntu
22.04.5 LTS (Jammy Jellyfish))
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2104979
Title:
Possible deadlock on closing ip_monitor thread.
Status in neutron:
New
Bug description:
`ip_monitor` form `neutron/agent/linux/ip_lib.py` has high possibility
to deadlock during shutdown.
Code assumes EOFError exception will be thrown, which introduces race
condition.
During code analyze i found that both implementations
pyroute2.iproute.IPRoute and pyroute2.nslink.nslink.NetNS used by
ip_monitor during call of close sends `NetlinkError(104, 'Connection
reset by peer')` back to receiving thread. I believe that for proper
handling of event_stop, read_ip_updates thread should handle this
error instead of relaying on EOFError exception.
For tests i created test network namespace, and flapped link state of loopback interface (but error can occur without any changes inside namespace:
```
sudo ip netns create test
sudo ip netns exec test bash
while [ 1 ] ; do ip l s down lo ; sleep 1 ; ip l s up lo ; sleep 1; done
```
Fail result can be reproduced using neutron test:
```
sudo timeout 3 python neutron/tests/functional/agent/linux/bin/ip_monitor.py temp_file test
```
As for now i know that this affects neutron_l3_agent in ovs
deployments since 23.2.0 (we discovered it happens till commit
https://github.com/openstack/neutron/commit/a7aeec703de2b1db2849da206fa349037ce23a0e).
I assume changing of pidfile management broke some cleanup process
that handles deadlocked daemons.
I verified in our lab that neutron_l3_agent processes leak can by fixed by simple fix like:
```
diff --git a/neutron/agent/linux/ip_lib.py b/neutron/agent/linux/ip_lib.py
index c9b42d83dd..1c5274b5b5 100644
--- a/neutron/agent/linux/ip_lib.py
+++ b/neutron/agent/linux/ip_lib.py
@@ -1534,6 +1534,8 @@ def ip_monitor(namespace, queue, event_stop, event_started):
while True:
ip_addresses = _ip.get()
for ip_address in ip_addresses:
+ if ip_address.get('error') == errno.ECONNRESET:
+ return
LOG.debug("IP monitor %s; Adding IP address: %s "
"to the queue.", namespace, ip_address)
_queue.put(ip_address)
```
As of bug inside ip_monitor i reproduced it on all neutron versions
between 16.0.0 and 25.1.0 using pyroute2==0.8.1
I did not run tests on other platfroms than linux (fedora 41 and
ubuntu 22.04.5 LTS (Jammy Jellyfish))
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2104979/+subscriptions