yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #88697
[Bug 1969270] Re: neutron-dhcp-agent memory leak on network sync failure
Reviewed: https://review.opendev.org/c/openstack/neutron/+/838492
Committed: https://opendev.org/openstack/neutron/commit/e3b3ec930967305e5fce314c0a4cf74151ad711c
Submitter: "Zuul (22348)"
Branch: master
commit e3b3ec930967305e5fce314c0a4cf74151ad711c
Author: Rodolfo Alonso Hernandez <ralonsoh@xxxxxxxxxx>
Date: Wed Apr 13 23:38:04 2022 +0000
[DHCP] Break reference chain to any Exception object when resync
In the DHCP agent, if an exception is raised during the driver call,
"DhcpAgent.schedule_resync" is called. Before this patch, the
exception instance was passed instead of a string. This instance
reference was stored in the dictionary "needs_resync_reasons" and
used in "_periodic_resync_helper" to resync the DHCP agent
information.
The call to "sync_state" passed the dictionary ".keys()" method. In
python2.7 when that was implemented, this method was creating a list
with the dictionary keys. In python3, this method is a generator
that holds the dictionary content.
This patch breaks this reference chain in two points (actually only
one is needed):
- "sync_state" now passes a list created from the mentioned generator.
- The dictionary "needs_resync_reasons" now stores the exception
strings only, instead of the exception instance.
Closes-Bug: #1969270
Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1969270
Title:
neutron-dhcp-agent memory leak on network sync failure
Status in neutron:
Fix Released
Bug description:
neutron version: 15.0.2 (still presents in the latest release)
I've found a very interesting memory leak issue in neutron-dhcp-agent:
When dhcp-agent tries to sync network state, it makes an rpc call to
neutron-server, if there's something wrong on neutron-server's
side(database access failure, for example), an error will be returned
to dhcp-agent and deserialized to an RemoteError object.
The RemoteError will be added to
neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic
resync. The following code in methond
neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper() handles
network resync:
if self.needs_resync_reasons:
# be careful to avoid a race with additions to list
# from other threads
reasons = self.needs_resync_reasons
self.needs_resync_reasons = collections.defaultdict(list)
for net, r in reasons.items():
if not net:
net = "*"
LOG.debug("resync (%(network)s): %(reason)s",
{"reason": r, "network": net})
self.sync_state(reasons.keys())
There's a trap here: since "reasons" is a defaultdict object,
"reasons.keys()" will hold a reference to "reasons", thus the
self.sync_state method frame will hold an indirect reference to the
previous RemoteError object.
When this self.sync_state is invoked, another RemoteError will be
raised since neutron-server is still malfunctioning. The RemoteError
object's tracebacks has a reference to sync_state frame which still
holds a reference to the previous RemoteError. So the history
RemoteError will never be garbage collected.
I've generated a reference graph using objgraph, which helps to
understand the reference chain. Please see the attachment.
One proposed fix is to modify self.sync_state(reasons.keys()) to
self.sync_state(list(reasons.keys())) in
DhcpAgent._periodic_resync_helper()
Another way is adding str(reason) to self.needs_resync_reasons instead
of reason object itself, in DhcpAgent.schedule_resync()
Both of them breaks the reference chain.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1969270/+subscriptions
References