← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1346861] [NEW] l3 cannot re-create device in deleted namespace

 

Public bug reported:

If an ovs-managed device (device created by add-port followed by set
type=internal)'s namespace is being used by some process and then
deleted, L3 agent will fail to re-create the device.

Steps to repro:

- Stop l3-agent.
- Choose a router namespace with at least one ovs-managed device in it. For example, "qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184" has a device "qg-df5a3693-ec"
- Ensure the namespace is used by at least one process. For demo purpose, start another shell using "ip netns exec qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184 bash". In reality, ns-metadata-proxy or keepalived may live in the namespace
- Delete the namespace by "ip netns del qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184". The command won't fail and the devices in the deleted namespace are still alive, observable by "ip link" in previously opened shell. However, there is no easy method to enter the namespace from outside again.
- Start l3 agent.
- Verify "qg-df5a3693-ec" cannot be recreated and managed by L3. The backtrace looks like (this is our branch, may differ with upstream):

  ERROR neutron.agent.l3_agent Failed synchronizing routers
  TRACE neutron.agent.l3_agent Traceback (most recent call last):
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 1429, in _sync_routers_task
  TRACE neutron.agent.l3_agent     self._process_routers(routers, all_routers=True)
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 1354, in _process_routers
  TRACE neutron.agent.l3_agent     self._router_added(r['id'], r)
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 672, in _router_added
  TRACE neutron.agent.l3_agent     self.process_ha_router_added(ri)
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 923, in process_ha_router_added
  TRACE neutron.agent.l3_agent     vip_cidrs=[gw_ip_cidr])
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 897, in ha_network_added
  TRACE neutron.agent.l3_agent     prefix=HA_DEV_PREFIX)
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 194, in plug
  TRACE neutron.agent.l3_agent     ns_dev.link.set_address(mac_address)
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 230, in set_address
  TRACE neutron.agent.l3_agent     self._as_root('set', self.name, 'address', mac_address)
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 217, in _as_root
  TRACE neutron.agent.l3_agent     kwargs.get('use_root_namespace', False))
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 70, in _as_root
  TRACE neutron.agent.l3_agent     namespace)
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 81, in _execute
  TRACE neutron.agent.l3_agent     root_helper=root_helper)
  TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/utils.py", line 90, in execute
  TRACE neutron.agent.l3_agent     raise RuntimeError(m)
  TRACE neutron.agent.l3_agent RuntimeError: 
  TRACE neutron.agent.l3_agent Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'link', 'set', 'ha-5bd08318-aa', 'address', 'fa:16:3e:f3:2b:6b']
  TRACE neutron.agent.l3_agent Exit code: 1
  TRACE neutron.agent.l3_agent Stdout: ''
  TRACE neutron.agent.l3_agent Stderr: 'Cannot find device "ha-5bd08318-aa"\n'
  TRACE neutron.agent.l3_agent 

The root cause is that ovs-vsctl "can perform any number of commands in
a single run, implemented as a single atomic transaction against the
database." and neutron currently use the following to create ovs-managed
device:

  ovs-vsctl -- --if-exists del-port qr-2f4c613d-b7 -- add-port br-int
qr-2f4c613d-b7 -- set Interface qr-2f4c613d-b7 type=internal -- set
Interface qr-2f4c613d-b7 external-ids:iface-id=2f4c613d-
b7f2-4d63-89c8-af2d48948d19 -- set Interface qr-2f4c613d-b7 external-ids
:iface-status=active -- set Interface qr-2f4c613d-b7 external-ids
:attached-mac=fa:16:3e:3c:4d:18

ovs can delete devices it manages even the device is in a deleted (lost)
namespace. But if del-port, add-port and set type=internal are put
together in one ovs-vsctl command, ovs will do nothing to the device and
the device is left as is.


In OVSInterfaceDriver.plug(self, network_id, port_id, device_name, mac_address,bridge=None, namespace=None, prefix=None):

    self._ovs_add_port(bridge, tap_name, port_id, mac_address,
                       internal=internal)

    ns_dev.link.set_address(mac_address)

    if self.conf.network_device_mtu:
        ns_dev.link.set_mtu(self.conf.network_device_mtu)
        if self.conf.ovs_use_veth:
            root_dev.link.set_mtu(self.conf.network_device_mtu)

    # Add an interface created by ovs to the namespace.
    if not self.conf.ovs_use_veth and namespace:
        namespace_obj = ip.ensure_namespace(namespace)
        namespace_obj.add_device_to_namespace(ns_dev)

You can see that set mac address, set mtu, set namespace stuff all uses
`ip` command directly, which requires `ip` to have access the the
device. The device created or re-created by ovs (in self._ovs_add_port)
must not belong to any namespace. This can be guarnteed by splitting the
giant ovs-vsctl command above into two parts:

  ovs-vsctl --if-exists del-port qr-2f4c613d-b7
  ovs-vsctl -- add-port br-int qr-2f4c613d-b7 -- set Interface qr-2f4c613d-b7 type=internal -- set Interface qr-2f4c613d-b7 external-ids:iface-id=2f4c613d-b7f2-4d63-89c8-af2d48948d19 -- set Interface qr-2f4c613d-b7 external-ids:iface-status=active -- set Interface qr-2f4c613d-b7 external-ids:attached-mac=fa:16:3e:3c:4d:18

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1346861

Title:
  l3 cannot re-create device in deleted namespace

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  If an ovs-managed device (device created by add-port followed by set
  type=internal)'s namespace is being used by some process and then
  deleted, L3 agent will fail to re-create the device.

  Steps to repro:

  - Stop l3-agent.
  - Choose a router namespace with at least one ovs-managed device in it. For example, "qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184" has a device "qg-df5a3693-ec"
  - Ensure the namespace is used by at least one process. For demo purpose, start another shell using "ip netns exec qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184 bash". In reality, ns-metadata-proxy or keepalived may live in the namespace
  - Delete the namespace by "ip netns del qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184". The command won't fail and the devices in the deleted namespace are still alive, observable by "ip link" in previously opened shell. However, there is no easy method to enter the namespace from outside again.
  - Start l3 agent.
  - Verify "qg-df5a3693-ec" cannot be recreated and managed by L3. The backtrace looks like (this is our branch, may differ with upstream):

    ERROR neutron.agent.l3_agent Failed synchronizing routers
    TRACE neutron.agent.l3_agent Traceback (most recent call last):
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 1429, in _sync_routers_task
    TRACE neutron.agent.l3_agent     self._process_routers(routers, all_routers=True)
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 1354, in _process_routers
    TRACE neutron.agent.l3_agent     self._router_added(r['id'], r)
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 672, in _router_added
    TRACE neutron.agent.l3_agent     self.process_ha_router_added(ri)
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 923, in process_ha_router_added
    TRACE neutron.agent.l3_agent     vip_cidrs=[gw_ip_cidr])
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 897, in ha_network_added
    TRACE neutron.agent.l3_agent     prefix=HA_DEV_PREFIX)
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 194, in plug
    TRACE neutron.agent.l3_agent     ns_dev.link.set_address(mac_address)
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 230, in set_address
    TRACE neutron.agent.l3_agent     self._as_root('set', self.name, 'address', mac_address)
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 217, in _as_root
    TRACE neutron.agent.l3_agent     kwargs.get('use_root_namespace', False))
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 70, in _as_root
    TRACE neutron.agent.l3_agent     namespace)
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 81, in _execute
    TRACE neutron.agent.l3_agent     root_helper=root_helper)
    TRACE neutron.agent.l3_agent   File "/opt/stack/neutron/neutron/agent/linux/utils.py", line 90, in execute
    TRACE neutron.agent.l3_agent     raise RuntimeError(m)
    TRACE neutron.agent.l3_agent RuntimeError: 
    TRACE neutron.agent.l3_agent Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'link', 'set', 'ha-5bd08318-aa', 'address', 'fa:16:3e:f3:2b:6b']
    TRACE neutron.agent.l3_agent Exit code: 1
    TRACE neutron.agent.l3_agent Stdout: ''
    TRACE neutron.agent.l3_agent Stderr: 'Cannot find device "ha-5bd08318-aa"\n'
    TRACE neutron.agent.l3_agent 

  The root cause is that ovs-vsctl "can perform any number of commands
  in a single run, implemented as a single atomic transaction against
  the database." and neutron currently use the following to create ovs-
  managed device:

    ovs-vsctl -- --if-exists del-port qr-2f4c613d-b7 -- add-port br-int
  qr-2f4c613d-b7 -- set Interface qr-2f4c613d-b7 type=internal -- set
  Interface qr-2f4c613d-b7 external-ids:iface-id=2f4c613d-
  b7f2-4d63-89c8-af2d48948d19 -- set Interface qr-2f4c613d-b7 external-
  ids:iface-status=active -- set Interface qr-2f4c613d-b7 external-ids
  :attached-mac=fa:16:3e:3c:4d:18

  ovs can delete devices it manages even the device is in a deleted
  (lost) namespace. But if del-port, add-port and set type=internal are
  put together in one ovs-vsctl command, ovs will do nothing to the
  device and the device is left as is.

  
  In OVSInterfaceDriver.plug(self, network_id, port_id, device_name, mac_address,bridge=None, namespace=None, prefix=None):

      self._ovs_add_port(bridge, tap_name, port_id, mac_address,
                         internal=internal)

      ns_dev.link.set_address(mac_address)

      if self.conf.network_device_mtu:
          ns_dev.link.set_mtu(self.conf.network_device_mtu)
          if self.conf.ovs_use_veth:
              root_dev.link.set_mtu(self.conf.network_device_mtu)

      # Add an interface created by ovs to the namespace.
      if not self.conf.ovs_use_veth and namespace:
          namespace_obj = ip.ensure_namespace(namespace)
          namespace_obj.add_device_to_namespace(ns_dev)

  You can see that set mac address, set mtu, set namespace stuff all
  uses `ip` command directly, which requires `ip` to have access the the
  device. The device created or re-created by ovs (in
  self._ovs_add_port) must not belong to any namespace. This can be
  guarnteed by splitting the giant ovs-vsctl command above into two
  parts:

    ovs-vsctl --if-exists del-port qr-2f4c613d-b7
    ovs-vsctl -- add-port br-int qr-2f4c613d-b7 -- set Interface qr-2f4c613d-b7 type=internal -- set Interface qr-2f4c613d-b7 external-ids:iface-id=2f4c613d-b7f2-4d63-89c8-af2d48948d19 -- set Interface qr-2f4c613d-b7 external-ids:iface-status=active -- set Interface qr-2f4c613d-b7 external-ids:attached-mac=fa:16:3e:3c:4d:18

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1346861/+subscriptions


Follow ups

References