← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2100752] [NEW] Severe Packet Loss During Live Migration Due to Premature Route Advertisement in Neutron BGP Dynamic Routing

 

Public bug reported:

- Summary

Neutron's BGP Dynamic Routing module advertises a new route for a
migrating instance before it is fully operational, leading to severe
packet loss. Traffic is forwarded to an unready next-hop, causing
connectivity disruptions.

- Steps to Reproduce

* Deploy an OpenStack Zed cluster with Neutron BGP Dynamic Routing enabled.
* Create a virtual instance with big memory like 128 GB and ensure it is advertised via BGP.
* Initiate a live migration of the instance to another compute node.
* Monitor BGP route advertisements during the migration process.
* Observe that a new route (with a different next-hop) is advertised before the instance is ready.
* Experience packet loss as traffic is sent to the new, but unready, instance.

- Expected Behavior
* The new BGP route should be advertised only after the instance is fully operational.
* The old route should remain active until the new one is confirmed as reachable.

- Actual Behavior
* Neutron BGP advertises the new route too soon, even before the instance has completed migration.
* The system experiences packet loss since traffic is routed to a non-functional instance.

- Logs & Evidence

Ping Loss and BGP route updates at same time

B* 10.216.12.36/32 [20/0] via 10.216.8.51, eth2, weight 1, 00:00:27
B* 10.216.12.36/32 [20/0] via 10.216.8.51, eth2, weight 1, 00:00:28
B* 10.216.12.36/32 [20/0] via 10.216.8.79, eth2, weight 1, 00:00:01  <-- New route before instance is ready
B* 10.216.12.36/32 [20/0] via 10.216.8.79, eth2, weight 1, 00:00:02

Ping logs showing packet loss due to early route advertisement:
64 bytes from 10.216.12.36: icmp_seq=4287 ttl=59 time=45.825 ms
64 bytes from 10.216.12.36: icmp_seq=4288 ttl=59 time=48.514 ms
Request timeout for icmp_seq 4296
Request timeout for icmp_seq 4297

- Environment Details

OpenStack Version: Zed (cluster installed via Kolla-Ansible)
OS Version: Ubuntu 22.04.4 LTS Hosts (Kernel: 5.15.0-117-generic)
Neutron Version: 21.1.3.dev24
Services: neutron-server, neutron-dhcp-agent, neutron-openvswitch-agent, neutron-l3-agent, neutron-bgp-dragent, neutron-metadata-agent
Controller & Network Nodes: 5 nodes
Networking Backend: OpenvSwitch (DVR mode)
Router HA: Disabled (l3_ha = false)
Neutron Router Setup: Single centralized router connecting all tenant networks to provider network.
BGP Dynamic Routing: neutron-bgp-dragent used to announce unique tenant networks.
Tenant Network Type: VXLAN
External Network Type: VLAN

- Impact
* Severe Packet Loss & Downtime during live migration.
* Production Disruptions for applications requiring uninterrupted network connectivity.
* ECMP or multipath routing issues due to premature next-hop selection.

- Additional Information
This issue significantly affects high-availability applications relying on seamless instance migrations.

Would appreciate feedback or a possible fix from the Neutron development
team.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2100752

Title:
  Severe Packet Loss During Live Migration Due to Premature Route
  Advertisement in Neutron BGP Dynamic Routing

Status in neutron:
  New

Bug description:
  - Summary

  Neutron's BGP Dynamic Routing module advertises a new route for a
  migrating instance before it is fully operational, leading to severe
  packet loss. Traffic is forwarded to an unready next-hop, causing
  connectivity disruptions.

  - Steps to Reproduce

  * Deploy an OpenStack Zed cluster with Neutron BGP Dynamic Routing enabled.
  * Create a virtual instance with big memory like 128 GB and ensure it is advertised via BGP.
  * Initiate a live migration of the instance to another compute node.
  * Monitor BGP route advertisements during the migration process.
  * Observe that a new route (with a different next-hop) is advertised before the instance is ready.
  * Experience packet loss as traffic is sent to the new, but unready, instance.

  - Expected Behavior
  * The new BGP route should be advertised only after the instance is fully operational.
  * The old route should remain active until the new one is confirmed as reachable.

  - Actual Behavior
  * Neutron BGP advertises the new route too soon, even before the instance has completed migration.
  * The system experiences packet loss since traffic is routed to a non-functional instance.

  - Logs & Evidence

  Ping Loss and BGP route updates at same time

  B* 10.216.12.36/32 [20/0] via 10.216.8.51, eth2, weight 1, 00:00:27
  B* 10.216.12.36/32 [20/0] via 10.216.8.51, eth2, weight 1, 00:00:28
  B* 10.216.12.36/32 [20/0] via 10.216.8.79, eth2, weight 1, 00:00:01  <-- New route before instance is ready
  B* 10.216.12.36/32 [20/0] via 10.216.8.79, eth2, weight 1, 00:00:02

  Ping logs showing packet loss due to early route advertisement:
  64 bytes from 10.216.12.36: icmp_seq=4287 ttl=59 time=45.825 ms
  64 bytes from 10.216.12.36: icmp_seq=4288 ttl=59 time=48.514 ms
  Request timeout for icmp_seq 4296
  Request timeout for icmp_seq 4297

  - Environment Details

  OpenStack Version: Zed (cluster installed via Kolla-Ansible)
  OS Version: Ubuntu 22.04.4 LTS Hosts (Kernel: 5.15.0-117-generic)
  Neutron Version: 21.1.3.dev24
  Services: neutron-server, neutron-dhcp-agent, neutron-openvswitch-agent, neutron-l3-agent, neutron-bgp-dragent, neutron-metadata-agent
  Controller & Network Nodes: 5 nodes
  Networking Backend: OpenvSwitch (DVR mode)
  Router HA: Disabled (l3_ha = false)
  Neutron Router Setup: Single centralized router connecting all tenant networks to provider network.
  BGP Dynamic Routing: neutron-bgp-dragent used to announce unique tenant networks.
  Tenant Network Type: VXLAN
  External Network Type: VLAN

  - Impact
  * Severe Packet Loss & Downtime during live migration.
  * Production Disruptions for applications requiring uninterrupted network connectivity.
  * ECMP or multipath routing issues due to premature next-hop selection.

  - Additional Information
  This issue significantly affects high-availability applications relying on seamless instance migrations.

  Would appreciate feedback or a possible fix from the Neutron
  development team.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2100752/+subscriptions