← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2095152] Re: ovs-agent: Leftover tpi/spi interfaces after VM boot/delete with trunk port(s)

 

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/949217
Committed: https://opendev.org/openstack/neutron/commit/e69505d293ef12999f948531acbde75e16a65cd4
Submitter: "Zuul (22348)"
Branch:    master

commit e69505d293ef12999f948531acbde75e16a65cd4
Author: Bence Romsics <bence.romsics@xxxxxxxxx>
Date:   Wed May 7 16:29:58 2025 +0200

    Limit trunk ACTIVE state hack to OVN
    
    In https://review.opendev.org/c/openstack/neutron/+/853779 we started
    moving a trunk to ACTIVE when its parent port went to ACTIVE. The
    intention was to not leave the trunk in DOWN after a live migration as
    reported in #1988549. However this had side effects. Earlier we moved a
    trunk to ACTIVE when all of its ports were processed. That means we
    unintentionally changed the meaning of the trunk ACTIVE status. This
    affected all backends and not just live migrate but create too.
    
    This change moves the logic of propagating the trunk parent's ACTIVE to
    the trunk itself to the OVN trunk driver, so we limit the undesired
    effects to ml2/ovn. By that we restore the original meaning of trunk
    ACTIVE for all non-OVN backends. Ideally we would want to limit the
    effect to live migrate (so we don't affect create) but I did not find a
    way to do that.
    
    Change-Id: I4d2c3db355e29fffcce0f50cd12bb1e31d1be43a
    Closes-Bug: #2095152
    Related-Bug: #1988549
    Related-Change: https://review.opendev.org/c/openstack/os-vif/+/949736


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2095152

Title:
  ovs-agent: Leftover tpi/spi interfaces after VM boot/delete with trunk
  port(s)

Status in neutron:
  Fix Released
Status in os-vif:
  Fix Released
Status in os-vif 2024.1 series:
  Triaged
Status in os-vif 2024.2 series:
  Triaged
Status in os-vif 2025.1 series:
  Triaged
Status in os-vif 2025.2 series:
  Fix Released

Bug description:
  We have seen tpi- and spi- interfaces in ovs not deleted by ovs-agent
  when they should have been deleted already.

  At the moment I only have a reproduction based on chance with wildly
  varying frequency of the error symptoms:

  ovs-dump() {
      for bridge in $( sudo ovs-vsctl list-br )
      do
          for port in $( sudo ovs-vsctl list-ports $bridge )
          do
              echo $bridge $port
          done
      done | sort
  }

  ovs-dump > ovs-state.0

  for j in $( seq 1 10 )
  do
      openstack network create tnet0
      openstack subnet create --network tnet0 --subnet-range 10.0.100.0/24 tsubnet0
      openstack port create --network tnet0 tport0
      openstack network trunk create --parent-port tport0 trunk0
      tport0_mac="$( openstack port show tport0 -f value -c mac_address )"

      for i in $( seq 1 30 )
      do
          openstack network create tnet$i
          openstack subnet create --network tnet$i --subnet-range 10.0.$(( 100 + $i )).0/24 tsubnet$i
          openstack port create --network tnet$i --mac-address "$tport0_mac" tport$i
          openstack network trunk set --subport port=tport$i,segmentation-type=vlan,segmentation-id=$(( 100 + $i )) trunk0
      done

      openstack server create --flavor cirros256 --image cirros-0.6.3-x86_64-disk --nic port-id=tport0 tvm0 --wait
      # Theoretically not needed, but still make sure we don't interrupt anything work in progress to make the repro more uniform.
      while [ "$( openstack network trunk show trunk0 -f value -c status )" != "ACTIVE" ]
      do
          sleep 1
      done

      openstack server delete tvm0 --wait
      openstack network trunk delete trunk0
      openstack port list -f value -c ID -c Name | awk '/tport/ { print $1 }' | xargs -r openstack port delete
      openstack net list -f value -c ID -c Name | awk '/tnet/ { print $1 }' | xargs -r openstack net delete
  done

  sleep 10
  ovs-dump > ovs-state.1

  diff -u ovs-state.{0,1}

  One example output with j=1..20 and i=1..30:

  --- ovs-state.0 2025-01-16 13:31:07.881407421 +0000
  +++ ovs-state.1 2025-01-16 14:52:45.323392243 +0000
  @@ -8,9 +8,27 @@
   br-int qr-88029aef-01
   br-int sg-73e24638-69
   br-int sg-e45cf925-de
  +br-int spi-1eeb4ae6-1b
  +br-int spi-2093a8c2-df
  +br-int spi-2d9ae883-d9
  +br-int spi-3f17d563-cd
  +br-int spi-9c0d9c98-d8
  +br-int spi-a2dc4baf-ef
  +br-int spi-af2efafa-39
  +br-int spi-c14e8bc3-62
  +br-int spi-c16959f8-da
  +br-int spi-e90d4d84-31
   br-int tap03961474-06
   br-int tap3e6a6311-95
   br-int tpi-1f8b5666-bf
  +br-int tpi-2477b06f-5d
  +br-int tpi-4421d69a-be
  +br-int tpi-572a3af8-42
   br-int tpi-9cf24ba1-ba
  +br-int tpi-9e60cb66-5e
  +br-int tpi-a533a27b-78
  +br-int tpi-cddcaa7b-15
  +br-int tpi-d7cd2e3e-e6
  +br-int tpi-e68ca29d-4d
   br-physnet0 phy-br-physnet0
   br-tun patch-int

  These ports are not even cleaned up by an ovs-agent restart. During
  the runs I have not found ERROR messages in ovs-agent logs.

  The amount of ports left behind varies wildly. I have seen cases when
  more than 50% of vm start/deletes left behind one tpi port. But I have
  also seen cases when I had to have ten runs (j=1..10) to see the first
  leftover interface. This makes me believe there's a causal factor
  present here (probably timing based) I don't understand and cannot
  control yet.

  I want to get back to analyse the root cause, however I hope that
  first I can find a quicker and more reliable reproduction method so it
  becomes easier to work with this.

  devstack 2f3440dc
  neutron 8cca47f2e7

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2095152/+subscriptions



References