← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1969270] Re: neutron-dhcp-agent memory leak on network sync failure

 

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/838492
Committed: https://opendev.org/openstack/neutron/commit/e3b3ec930967305e5fce314c0a4cf74151ad711c
Submitter: "Zuul (22348)"
Branch:    master

commit e3b3ec930967305e5fce314c0a4cf74151ad711c
Author: Rodolfo Alonso Hernandez <ralonsoh@xxxxxxxxxx>
Date:   Wed Apr 13 23:38:04 2022 +0000

    [DHCP] Break reference chain to any Exception object when resync
    
    In the DHCP agent, if an exception is raised during the driver call,
    "DhcpAgent.schedule_resync" is called. Before this patch, the
    exception instance was passed instead of a string. This instance
    reference was stored in the dictionary "needs_resync_reasons" and
    used in "_periodic_resync_helper" to resync the DHCP agent
    information.
    
    The call to "sync_state" passed the dictionary ".keys()" method. In
    python2.7 when that was implemented, this method was creating a list
    with the dictionary keys. In python3, this method is a generator
    that holds the dictionary content.
    
    This patch breaks this reference chain in two points (actually only
    one is needed):
    - "sync_state" now passes a list created from the mentioned generator.
    - The dictionary "needs_resync_reasons" now stores the exception
      strings only, instead of the exception instance.
    
    Closes-Bug: #1969270
    Change-Id: I07e9818021283d321fc32066be7e0f8e2b81e639


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1969270

Title:
  neutron-dhcp-agent memory leak on network sync failure

Status in neutron:
  Fix Released

Bug description:
  neutron version: 15.0.2 (still presents in the latest release)

  I've found a very interesting memory leak issue in neutron-dhcp-agent:

  When dhcp-agent tries to sync network state, it makes an rpc call to
  neutron-server, if there's something wrong on neutron-server's
  side(database access failure, for example), an error will be returned
  to dhcp-agent and deserialized to an RemoteError object.

  The RemoteError will be added to
  neutron.agent.dhcp.agent.DhcpAgent.needs_resync_reasons for periodic
  resync. The following code in methond
  neutron.agent.dhcp.agent.DhcpAgent._periodic_resync_helper() handles
  network resync:

              if self.needs_resync_reasons:
                  # be careful to avoid a race with additions to list
                  # from other threads
                  reasons = self.needs_resync_reasons
                  self.needs_resync_reasons = collections.defaultdict(list)
                  for net, r in reasons.items():
                      if not net:
                          net = "*"
                      LOG.debug("resync (%(network)s): %(reason)s",
                                {"reason": r, "network": net})
                  self.sync_state(reasons.keys())

  There's a trap here: since "reasons" is a defaultdict object,
  "reasons.keys()"  will hold a reference to "reasons", thus the
  self.sync_state method frame will hold an indirect reference to the
  previous RemoteError object.

  When this self.sync_state is invoked, another RemoteError will be
  raised since neutron-server is still malfunctioning. The RemoteError
  object's tracebacks has a reference to sync_state frame which still
  holds a reference to the previous RemoteError. So the history
  RemoteError will never be garbage collected.

  I've generated a reference graph using objgraph, which helps to
  understand the reference chain. Please see the attachment.

  One proposed fix is to modify self.sync_state(reasons.keys()) to
  self.sync_state(list(reasons.keys())) in
  DhcpAgent._periodic_resync_helper()

  Another way is adding str(reason) to self.needs_resync_reasons instead
  of reason object itself, in DhcpAgent.schedule_resync()

  Both of them breaks the reference chain.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1969270/+subscriptions



References