yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #96080
[Bug 2095152] Re: ovs-agent: Leftover tpi/spi interfaces after VM boot/delete with trunk port(s)
Reviewed: https://review.opendev.org/c/openstack/neutron/+/949217
Committed: https://opendev.org/openstack/neutron/commit/e69505d293ef12999f948531acbde75e16a65cd4
Submitter: "Zuul (22348)"
Branch: master
commit e69505d293ef12999f948531acbde75e16a65cd4
Author: Bence Romsics <bence.romsics@xxxxxxxxx>
Date: Wed May 7 16:29:58 2025 +0200
Limit trunk ACTIVE state hack to OVN
In https://review.opendev.org/c/openstack/neutron/+/853779 we started
moving a trunk to ACTIVE when its parent port went to ACTIVE. The
intention was to not leave the trunk in DOWN after a live migration as
reported in #1988549. However this had side effects. Earlier we moved a
trunk to ACTIVE when all of its ports were processed. That means we
unintentionally changed the meaning of the trunk ACTIVE status. This
affected all backends and not just live migrate but create too.
This change moves the logic of propagating the trunk parent's ACTIVE to
the trunk itself to the OVN trunk driver, so we limit the undesired
effects to ml2/ovn. By that we restore the original meaning of trunk
ACTIVE for all non-OVN backends. Ideally we would want to limit the
effect to live migrate (so we don't affect create) but I did not find a
way to do that.
Change-Id: I4d2c3db355e29fffcce0f50cd12bb1e31d1be43a
Closes-Bug: #2095152
Related-Bug: #1988549
Related-Change: https://review.opendev.org/c/openstack/os-vif/+/949736
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2095152
Title:
ovs-agent: Leftover tpi/spi interfaces after VM boot/delete with trunk
port(s)
Status in neutron:
Fix Released
Status in os-vif:
Fix Released
Status in os-vif 2024.1 series:
Triaged
Status in os-vif 2024.2 series:
Triaged
Status in os-vif 2025.1 series:
Triaged
Status in os-vif 2025.2 series:
Fix Released
Bug description:
We have seen tpi- and spi- interfaces in ovs not deleted by ovs-agent
when they should have been deleted already.
At the moment I only have a reproduction based on chance with wildly
varying frequency of the error symptoms:
ovs-dump() {
for bridge in $( sudo ovs-vsctl list-br )
do
for port in $( sudo ovs-vsctl list-ports $bridge )
do
echo $bridge $port
done
done | sort
}
ovs-dump > ovs-state.0
for j in $( seq 1 10 )
do
openstack network create tnet0
openstack subnet create --network tnet0 --subnet-range 10.0.100.0/24 tsubnet0
openstack port create --network tnet0 tport0
openstack network trunk create --parent-port tport0 trunk0
tport0_mac="$( openstack port show tport0 -f value -c mac_address )"
for i in $( seq 1 30 )
do
openstack network create tnet$i
openstack subnet create --network tnet$i --subnet-range 10.0.$(( 100 + $i )).0/24 tsubnet$i
openstack port create --network tnet$i --mac-address "$tport0_mac" tport$i
openstack network trunk set --subport port=tport$i,segmentation-type=vlan,segmentation-id=$(( 100 + $i )) trunk0
done
openstack server create --flavor cirros256 --image cirros-0.6.3-x86_64-disk --nic port-id=tport0 tvm0 --wait
# Theoretically not needed, but still make sure we don't interrupt anything work in progress to make the repro more uniform.
while [ "$( openstack network trunk show trunk0 -f value -c status )" != "ACTIVE" ]
do
sleep 1
done
openstack server delete tvm0 --wait
openstack network trunk delete trunk0
openstack port list -f value -c ID -c Name | awk '/tport/ { print $1 }' | xargs -r openstack port delete
openstack net list -f value -c ID -c Name | awk '/tnet/ { print $1 }' | xargs -r openstack net delete
done
sleep 10
ovs-dump > ovs-state.1
diff -u ovs-state.{0,1}
One example output with j=1..20 and i=1..30:
--- ovs-state.0 2025-01-16 13:31:07.881407421 +0000
+++ ovs-state.1 2025-01-16 14:52:45.323392243 +0000
@@ -8,9 +8,27 @@
br-int qr-88029aef-01
br-int sg-73e24638-69
br-int sg-e45cf925-de
+br-int spi-1eeb4ae6-1b
+br-int spi-2093a8c2-df
+br-int spi-2d9ae883-d9
+br-int spi-3f17d563-cd
+br-int spi-9c0d9c98-d8
+br-int spi-a2dc4baf-ef
+br-int spi-af2efafa-39
+br-int spi-c14e8bc3-62
+br-int spi-c16959f8-da
+br-int spi-e90d4d84-31
br-int tap03961474-06
br-int tap3e6a6311-95
br-int tpi-1f8b5666-bf
+br-int tpi-2477b06f-5d
+br-int tpi-4421d69a-be
+br-int tpi-572a3af8-42
br-int tpi-9cf24ba1-ba
+br-int tpi-9e60cb66-5e
+br-int tpi-a533a27b-78
+br-int tpi-cddcaa7b-15
+br-int tpi-d7cd2e3e-e6
+br-int tpi-e68ca29d-4d
br-physnet0 phy-br-physnet0
br-tun patch-int
These ports are not even cleaned up by an ovs-agent restart. During
the runs I have not found ERROR messages in ovs-agent logs.
The amount of ports left behind varies wildly. I have seen cases when
more than 50% of vm start/deletes left behind one tpi port. But I have
also seen cases when I had to have ten runs (j=1..10) to see the first
leftover interface. This makes me believe there's a causal factor
present here (probably timing based) I don't understand and cannot
control yet.
I want to get back to analyse the root cause, however I hope that
first I can find a quicker and more reliable reproduction method so it
becomes easier to work with this.
devstack 2f3440dc
neutron 8cca47f2e7
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2095152/+subscriptions
References