← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1874733] [NEW] [OVN] Stale ports can be present in OVN NB leading to metadata errors

 

Public bug reported:

Right now, there's a chance that deleting a port in Neutron with ML2/OVN
actually deletes the object from Neutron DB while leaving a stale port
in the OVN NB database.

This can happen when deleting a port [0] raises a RowNotFound exception.
While it may look like it'd mean that the port didn't exist already in
OVN NB truth is that the current port_delete function can throw that
exception for different reasons (especially against OVN < 2.10 when
Address Sets were used instead of Port Groups).

Such exception can be observed for example if some ACL or Address Set
doesn't exist [1][2] amongst others. In this case, the revision number
of the object will be deleted [3] and the port will be stale forever in
OVN NB (it'll be skipped by the maintenance task).

One of the main impacts of this issue is that the OVN NB database will
grow and have stale objects that are undetected (they'll be detected by
the neutron-ovn-db-sync-script) but most importantly, that multiple
ports in the same OVN Logical Switch may have the same IP addresses and
this cause legitimate ports to be left without Metadata.

As per metadata agent code here [4] if more than one port in the same
network has the same IP address, a 404 will be returned back to the
instance upon requesting metadata.

The workaround is running the neutron-db-sync script in repair mode to
get rid of the stale ports.

A proper fix would involve a better granularity of the exceptions that
can happen around a port deletion and acting accordingly upon each of
them. In the worst case, we won't be deleting the revision number if the
port still exists leaving up to the Maintenance task to fix it later on
(< 5 minutes). Ideally, we should identify all possible code paths and
delete the port from OVN whenever possible even if some other associated
operation fails (with proper logging).


Also, this scenario seems to be more likely under a high concurrency of API operations (such as heat) and possibly when Port Groups are not supported by the schema (OVN < 2.10).

Danie Alvarez


[0] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L719
[1] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L680
[2] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L690
[3] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L722
[4] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/agent/ovn/metadata/server.py#L86

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: ovn

** Tags added: ovn

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1874733

Title:
  [OVN] Stale ports can be present in OVN NB leading to metadata errors

Status in neutron:
  New

Bug description:
  Right now, there's a chance that deleting a port in Neutron with
  ML2/OVN actually deletes the object from Neutron DB while leaving a
  stale port in the OVN NB database.

  This can happen when deleting a port [0] raises a RowNotFound
  exception. While it may look like it'd mean that the port didn't exist
  already in OVN NB truth is that the current port_delete function can
  throw that exception for different reasons (especially against OVN <
  2.10 when Address Sets were used instead of Port Groups).

  Such exception can be observed for example if some ACL or Address Set
  doesn't exist [1][2] amongst others. In this case, the revision number
  of the object will be deleted [3] and the port will be stale forever
  in OVN NB (it'll be skipped by the maintenance task).

  One of the main impacts of this issue is that the OVN NB database will
  grow and have stale objects that are undetected (they'll be detected
  by the neutron-ovn-db-sync-script) but most importantly, that multiple
  ports in the same OVN Logical Switch may have the same IP addresses
  and this cause legitimate ports to be left without Metadata.

  As per metadata agent code here [4] if more than one port in the same
  network has the same IP address, a 404 will be returned back to the
  instance upon requesting metadata.

  The workaround is running the neutron-db-sync script in repair mode to
  get rid of the stale ports.

  A proper fix would involve a better granularity of the exceptions that
  can happen around a port deletion and acting accordingly upon each of
  them. In the worst case, we won't be deleting the revision number if
  the port still exists leaving up to the Maintenance task to fix it
  later on (< 5 minutes). Ideally, we should identify all possible code
  paths and delete the port from OVN whenever possible even if some
  other associated operation fails (with proper logging).

  
  Also, this scenario seems to be more likely under a high concurrency of API operations (such as heat) and possibly when Port Groups are not supported by the schema (OVN < 2.10).

  Danie Alvarez

  
  [0] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L719
  [1] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L680
  [2] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L690
  [3] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L722
  [4] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/agent/ovn/metadata/server.py#L86

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1874733/+subscriptions


Follow ups