yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #95432
[Bug 2100752] [NEW] Severe Packet Loss During Live Migration Due to Premature Route Advertisement in Neutron BGP Dynamic Routing
Public bug reported:
- Summary
Neutron's BGP Dynamic Routing module advertises a new route for a
migrating instance before it is fully operational, leading to severe
packet loss. Traffic is forwarded to an unready next-hop, causing
connectivity disruptions.
- Steps to Reproduce
* Deploy an OpenStack Zed cluster with Neutron BGP Dynamic Routing enabled.
* Create a virtual instance with big memory like 128 GB and ensure it is advertised via BGP.
* Initiate a live migration of the instance to another compute node.
* Monitor BGP route advertisements during the migration process.
* Observe that a new route (with a different next-hop) is advertised before the instance is ready.
* Experience packet loss as traffic is sent to the new, but unready, instance.
- Expected Behavior
* The new BGP route should be advertised only after the instance is fully operational.
* The old route should remain active until the new one is confirmed as reachable.
- Actual Behavior
* Neutron BGP advertises the new route too soon, even before the instance has completed migration.
* The system experiences packet loss since traffic is routed to a non-functional instance.
- Logs & Evidence
Ping Loss and BGP route updates at same time
B* 10.216.12.36/32 [20/0] via 10.216.8.51, eth2, weight 1, 00:00:27
B* 10.216.12.36/32 [20/0] via 10.216.8.51, eth2, weight 1, 00:00:28
B* 10.216.12.36/32 [20/0] via 10.216.8.79, eth2, weight 1, 00:00:01 <-- New route before instance is ready
B* 10.216.12.36/32 [20/0] via 10.216.8.79, eth2, weight 1, 00:00:02
Ping logs showing packet loss due to early route advertisement:
64 bytes from 10.216.12.36: icmp_seq=4287 ttl=59 time=45.825 ms
64 bytes from 10.216.12.36: icmp_seq=4288 ttl=59 time=48.514 ms
Request timeout for icmp_seq 4296
Request timeout for icmp_seq 4297
- Environment Details
OpenStack Version: Zed (cluster installed via Kolla-Ansible)
OS Version: Ubuntu 22.04.4 LTS Hosts (Kernel: 5.15.0-117-generic)
Neutron Version: 21.1.3.dev24
Services: neutron-server, neutron-dhcp-agent, neutron-openvswitch-agent, neutron-l3-agent, neutron-bgp-dragent, neutron-metadata-agent
Controller & Network Nodes: 5 nodes
Networking Backend: OpenvSwitch (DVR mode)
Router HA: Disabled (l3_ha = false)
Neutron Router Setup: Single centralized router connecting all tenant networks to provider network.
BGP Dynamic Routing: neutron-bgp-dragent used to announce unique tenant networks.
Tenant Network Type: VXLAN
External Network Type: VLAN
- Impact
* Severe Packet Loss & Downtime during live migration.
* Production Disruptions for applications requiring uninterrupted network connectivity.
* ECMP or multipath routing issues due to premature next-hop selection.
- Additional Information
This issue significantly affects high-availability applications relying on seamless instance migrations.
Would appreciate feedback or a possible fix from the Neutron development
team.
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2100752
Title:
Severe Packet Loss During Live Migration Due to Premature Route
Advertisement in Neutron BGP Dynamic Routing
Status in neutron:
New
Bug description:
- Summary
Neutron's BGP Dynamic Routing module advertises a new route for a
migrating instance before it is fully operational, leading to severe
packet loss. Traffic is forwarded to an unready next-hop, causing
connectivity disruptions.
- Steps to Reproduce
* Deploy an OpenStack Zed cluster with Neutron BGP Dynamic Routing enabled.
* Create a virtual instance with big memory like 128 GB and ensure it is advertised via BGP.
* Initiate a live migration of the instance to another compute node.
* Monitor BGP route advertisements during the migration process.
* Observe that a new route (with a different next-hop) is advertised before the instance is ready.
* Experience packet loss as traffic is sent to the new, but unready, instance.
- Expected Behavior
* The new BGP route should be advertised only after the instance is fully operational.
* The old route should remain active until the new one is confirmed as reachable.
- Actual Behavior
* Neutron BGP advertises the new route too soon, even before the instance has completed migration.
* The system experiences packet loss since traffic is routed to a non-functional instance.
- Logs & Evidence
Ping Loss and BGP route updates at same time
B* 10.216.12.36/32 [20/0] via 10.216.8.51, eth2, weight 1, 00:00:27
B* 10.216.12.36/32 [20/0] via 10.216.8.51, eth2, weight 1, 00:00:28
B* 10.216.12.36/32 [20/0] via 10.216.8.79, eth2, weight 1, 00:00:01 <-- New route before instance is ready
B* 10.216.12.36/32 [20/0] via 10.216.8.79, eth2, weight 1, 00:00:02
Ping logs showing packet loss due to early route advertisement:
64 bytes from 10.216.12.36: icmp_seq=4287 ttl=59 time=45.825 ms
64 bytes from 10.216.12.36: icmp_seq=4288 ttl=59 time=48.514 ms
Request timeout for icmp_seq 4296
Request timeout for icmp_seq 4297
- Environment Details
OpenStack Version: Zed (cluster installed via Kolla-Ansible)
OS Version: Ubuntu 22.04.4 LTS Hosts (Kernel: 5.15.0-117-generic)
Neutron Version: 21.1.3.dev24
Services: neutron-server, neutron-dhcp-agent, neutron-openvswitch-agent, neutron-l3-agent, neutron-bgp-dragent, neutron-metadata-agent
Controller & Network Nodes: 5 nodes
Networking Backend: OpenvSwitch (DVR mode)
Router HA: Disabled (l3_ha = false)
Neutron Router Setup: Single centralized router connecting all tenant networks to provider network.
BGP Dynamic Routing: neutron-bgp-dragent used to announce unique tenant networks.
Tenant Network Type: VXLAN
External Network Type: VLAN
- Impact
* Severe Packet Loss & Downtime during live migration.
* Production Disruptions for applications requiring uninterrupted network connectivity.
* ECMP or multipath routing issues due to premature next-hop selection.
- Additional Information
This issue significantly affects high-availability applications relying on seamless instance migrations.
Would appreciate feedback or a possible fix from the Neutron
development team.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2100752/+subscriptions