yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #82413
[Bug 1874733] [NEW] [OVN] Stale ports can be present in OVN NB leading to metadata errors
Public bug reported:
Right now, there's a chance that deleting a port in Neutron with ML2/OVN
actually deletes the object from Neutron DB while leaving a stale port
in the OVN NB database.
This can happen when deleting a port [0] raises a RowNotFound exception.
While it may look like it'd mean that the port didn't exist already in
OVN NB truth is that the current port_delete function can throw that
exception for different reasons (especially against OVN < 2.10 when
Address Sets were used instead of Port Groups).
Such exception can be observed for example if some ACL or Address Set
doesn't exist [1][2] amongst others. In this case, the revision number
of the object will be deleted [3] and the port will be stale forever in
OVN NB (it'll be skipped by the maintenance task).
One of the main impacts of this issue is that the OVN NB database will
grow and have stale objects that are undetected (they'll be detected by
the neutron-ovn-db-sync-script) but most importantly, that multiple
ports in the same OVN Logical Switch may have the same IP addresses and
this cause legitimate ports to be left without Metadata.
As per metadata agent code here [4] if more than one port in the same
network has the same IP address, a 404 will be returned back to the
instance upon requesting metadata.
The workaround is running the neutron-db-sync script in repair mode to
get rid of the stale ports.
A proper fix would involve a better granularity of the exceptions that
can happen around a port deletion and acting accordingly upon each of
them. In the worst case, we won't be deleting the revision number if the
port still exists leaving up to the Maintenance task to fix it later on
(< 5 minutes). Ideally, we should identify all possible code paths and
delete the port from OVN whenever possible even if some other associated
operation fails (with proper logging).
Also, this scenario seems to be more likely under a high concurrency of API operations (such as heat) and possibly when Port Groups are not supported by the schema (OVN < 2.10).
Danie Alvarez
[0] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L719
[1] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L680
[2] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L690
[3] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L722
[4] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/agent/ovn/metadata/server.py#L86
** Affects: neutron
Importance: Undecided
Status: New
** Tags: ovn
** Tags added: ovn
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1874733
Title:
[OVN] Stale ports can be present in OVN NB leading to metadata errors
Status in neutron:
New
Bug description:
Right now, there's a chance that deleting a port in Neutron with
ML2/OVN actually deletes the object from Neutron DB while leaving a
stale port in the OVN NB database.
This can happen when deleting a port [0] raises a RowNotFound
exception. While it may look like it'd mean that the port didn't exist
already in OVN NB truth is that the current port_delete function can
throw that exception for different reasons (especially against OVN <
2.10 when Address Sets were used instead of Port Groups).
Such exception can be observed for example if some ACL or Address Set
doesn't exist [1][2] amongst others. In this case, the revision number
of the object will be deleted [3] and the port will be stale forever
in OVN NB (it'll be skipped by the maintenance task).
One of the main impacts of this issue is that the OVN NB database will
grow and have stale objects that are undetected (they'll be detected
by the neutron-ovn-db-sync-script) but most importantly, that multiple
ports in the same OVN Logical Switch may have the same IP addresses
and this cause legitimate ports to be left without Metadata.
As per metadata agent code here [4] if more than one port in the same
network has the same IP address, a 404 will be returned back to the
instance upon requesting metadata.
The workaround is running the neutron-db-sync script in repair mode to
get rid of the stale ports.
A proper fix would involve a better granularity of the exceptions that
can happen around a port deletion and acting accordingly upon each of
them. In the worst case, we won't be deleting the revision number if
the port still exists leaving up to the Maintenance task to fix it
later on (< 5 minutes). Ideally, we should identify all possible code
paths and delete the port from OVN whenever possible even if some
other associated operation fails (with proper logging).
Also, this scenario seems to be more likely under a high concurrency of API operations (such as heat) and possibly when Port Groups are not supported by the schema (OVN < 2.10).
Danie Alvarez
[0] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L719
[1] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L680
[2] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L690
[3] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L722
[4] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/agent/ovn/metadata/server.py#L86
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1874733/+subscriptions
Follow ups