yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #88674
[Bug 1969270] [NEW] neutron-dhcp-agent memory leak on network sync failure
Public bug reported:
neutron version: 15.0.2 (still presents in the latest release)
I've found a very interesting memory leak issue in neutron-dhcp-agent:
When dhcp-agent tries to sync network state, it makes an rpc call to
neutron-server, if there's something wrong on neutron-server's
side(database access failure, for example), an error will be returned to
dhcp-agent and deserialized to an RemoteError object.
The RemoteError will be added to
neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic
resync. The following code in methond
neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper handles
network resync:
if self.needs_resync_reasons:
# be careful to avoid a race with additions to list
# from other threads
reasons = self.needs_resync_reasons
self.needs_resync_reasons = collections.defaultdict(list)
for net, r in reasons.items():
if not net:
net = "*"
LOG.debug("resync (%(network)s): %(reason)s",
{"reason": r, "network": net})
self.sync_state(reasons.keys())
There's a trap here: since "reasons" is a defaultdict object,
"reasons.keys()" still holds a reference to "reasons", thus the
self.sync_state method frame will hold an indirect reference to the
previous RemoteError object.
When this self.sync_state is invoked, another RemoteError will be raised
since neutron-server is still malfunctioning. The RemoteError object's
tracebacks has a reference to sync_state frame which still holds a
reference to the previous RemoteError. So the history RemoteError will
never be garbage collected.
I've generated a reference graph using objgraph, which helps to
understand the reference chain. Please see the attachment.
One proposed fix is to modify self.sync_state(reasons.keys()) to
self.sync_state(list(reasons.keys())) in
DhcpAgent._periodic_resync_helper
Another way is adding str(reason) to self.needs_resync_reasons instead
of reason object itself, in DhcpAgent.schedule_resync
Both of them breaks the reference chain.
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1969270
Title:
neutron-dhcp-agent memory leak on network sync failure
Status in neutron:
New
Bug description:
neutron version: 15.0.2 (still presents in the latest release)
I've found a very interesting memory leak issue in neutron-dhcp-agent:
When dhcp-agent tries to sync network state, it makes an rpc call to
neutron-server, if there's something wrong on neutron-server's
side(database access failure, for example), an error will be returned
to dhcp-agent and deserialized to an RemoteError object.
The RemoteError will be added to
neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic
resync. The following code in methond
neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper handles
network resync:
if self.needs_resync_reasons:
# be careful to avoid a race with additions to list
# from other threads
reasons = self.needs_resync_reasons
self.needs_resync_reasons = collections.defaultdict(list)
for net, r in reasons.items():
if not net:
net = "*"
LOG.debug("resync (%(network)s): %(reason)s",
{"reason": r, "network": net})
self.sync_state(reasons.keys())
There's a trap here: since "reasons" is a defaultdict object,
"reasons.keys()" still holds a reference to "reasons", thus the
self.sync_state method frame will hold an indirect reference to the
previous RemoteError object.
When this self.sync_state is invoked, another RemoteError will be
raised since neutron-server is still malfunctioning. The RemoteError
object's tracebacks has a reference to sync_state frame which still
holds a reference to the previous RemoteError. So the history
RemoteError will never be garbage collected.
I've generated a reference graph using objgraph, which helps to
understand the reference chain. Please see the attachment.
One proposed fix is to modify self.sync_state(reasons.keys()) to
self.sync_state(list(reasons.keys())) in
DhcpAgent._periodic_resync_helper
Another way is adding str(reason) to self.needs_resync_reasons instead
of reason object itself, in DhcpAgent.schedule_resync
Both of them breaks the reference chain.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1969270/+subscriptions
Follow ups