← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2037102] Re: neutron-ovn-metadata-agent dies on broken namespace

 

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/896251
Committed: https://opendev.org/openstack/neutron/commit/566fea3fed837b0130023303c770aade391d3d61
Submitter: "Zuul (22348)"
Branch:    master

commit 566fea3fed837b0130023303c770aade391d3d61
Author: Felix Huettner <felix.huettner@mail.schwarz>
Date:   Fri Sep 22 16:25:10 2023 +0200

    fix netns deletion of broken namespaces
    
    normal network namespaces are bind-mounted to files under
    /var/run/netns. If a process deleting a network namespace gets killed
    during that operation there is the chance that the bind mount to the
    netns has been removed, but the file under /var/run/netns still exists.
    
    When the neutron-ovn-metadata-agent tries to clean up such network
    namespaces it first tires to validate that the network namespace is
    empty. For the cases described above this fails, as this network
    namespace no longer really exists, but is just a stray file laying
    around.
    
    To fix this we treat network namespaces where we get an `OSError` with
    errno 22 (Invalid Argument) as empty. The calls to pyroute2 to delete
    the namespace will then clean up the file.
    
    Additionally we add a guard to teardown_datapath to continue even if
    this fails. failing to remove a datapath is not critical and leaves in
    the worst case a process and a network namespace running, however
    previously it would have also prevented the creation of new datapaths
    which is critical for VM startup.
    
    Closes-Bug: #2037102
    Change-Id: I7c43812fed5903f98a2e491076c24a8d926a59b4


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2037102

Title:
  neutron-ovn-metadata-agent dies on broken namespace

Status in neutron:
  Fix Released

Bug description:
  neutron-ovn-metadata-agent uses network namespaces to separate the
  metadata services for individual networks. For each network it
  automatically creates or destroys an appropriate namespace.

  If the metadata agent dies for reasons outside of its control (e.g. a
  SIGKILL) during the process of namespace destruction a broken
  namespace can be left over.

  ---
  Background on pyroute2 namespace management:

  Creating a network namespace works by:
  1. Forking the process and doing everything in the new child
  2. Ensuring /var/run/netns exists
  3. Ensuring the file for the network namespace under /var/run/netns exists by creating a new empty file
  4. calling `unshare` with `CLONE_NEWNET` to move the process to a new network namespace
  5. Creating a bind mount from `/proc/self/ns/net` to the file under /var/run/netns

  Deleting a network namespace works the other way around (but shorter):
  1. Unmounting the previously created bind mount
  2. Deleting the file for the network namespace

  ---

  If the neutron-ovn-metadata-agent is killed between step 1 and 2 of
  deleting the network namespace then the namespace file will still be
  around, but not point to any namespace.

  When `garbage_collect_namespace` tries to check if the namespace is empty it tries to enter the network namespace to dump all devices in there. This raises an exception as the namespace can no longer be entered.
  neutron-ovn-metadata-agent then crashes and tries again next time, crashing again.

  
  ```
  Traceback (most recent call last):,
     File "/usr/local/bin/neutron-ovn-metadata-agent", line 8, in <module>,
       sys.exit(main()),
     File "/usr/local/lib/python3.9/site-packages/neutron/cmd/eventlet/agents/ovn_metadata.py", line 24, in main,
       metadata_agent.main(),
     File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata_agent.py", line 41, in main,
       agt.start(),
     File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 277, in start,
       self.sync(),
     File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 61, in wrapped,
       return f(*args, **kwargs),
     File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 349, in sync,
       self.teardown_datapath(self._get_datapath_name(ns)),
     File "/usr/local/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py", line 400, in teardown_datapath,
       ip.garbage_collect_namespace(),
     File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 268, in garbage_collect_namespace,
       if self.namespace_is_empty():,
     File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 263, in namespace_is_empty,
       return not self.get_devices(),
     File "/usr/local/lib/python3.9/site-packages/neutron/agent/linux/ip_lib.py", line 180, in get_devices,
       devices = privileged.get_device_names(self.namespace),
     File "/usr/local/lib/python3.9/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 609, in get_device_names,
       in get_link_devices(namespace, **kwargs)],
     File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f,
       return self(f, *args, **kw),
     File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__,
       do = self.iter(retry_state=retry_state),
     File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter,
       return fut.result(),
     File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result,
       return self.__get_result(),
     File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result,
       raise self._exception,
     File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__,
       result = fn(*args, **kwargs),
     File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap,
       return self.channel.remote_call(name, args, kwargs,,
     File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call,
       raise exc_type(*result[2]),
  OSError: [Errno 22] failed to open netns
  ```

  
  Versions: afaik affects all versions

  Reproduction: best by creating a empty file with the name
  `/var/run/netns/ovnmeta-<some-uuid>` and restarting the neutron-ovn-
  metadata-agent. Otherwise a breakpoint or a good timed kill command

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2037102/+subscriptions



References