yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1715340] [NEW] [RFE] reduce the duration of network interrupt during live migration in DVR scenario

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: zhaobo <zhaobo6@xxxxxxxxxx>
Date: Wed, 06 Sep 2017 09:22:32 -0000
Reply-to: Bug 1715340 <1715340@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Nova contains 3 stages when process 1ive migration:
1. pre_live_migration
2. migrating
3. post_live_migration
The current implement, nova will plug a new vif on the target host. The ovs-agent on the target host will process this new vif, and try to up this port on target host. But the port host_id is src host now.The agent send a rpc to server and return nothing..
After nova process the real migration in stage 2. Maybe the flavor of the instance is small and the duration is very short. Then in stage 3, nova call neutron to update the port's host_id of instance. Network interrupt begins. In the whole live migration ,the vm status is always ACTIVE. But users can not login the VM, or the applications running in the VM will be offline for a while. The reason is neutron process the whole traffic is too late. When nova migrate the instance to the target host, and setup the instance by libvirt, the network traffic provided by neutron is not ready, that means we need to verify both l2 and l3 connection are ready for this.

We test in our product env which is the old release Mitaka(I still think
there is the same issue in master), the interrupt time last depends on
the port counts in the router subnets, also whether the port is
associated with floatingip. When the ports counts <20, the interrupt
duration <= 8 seconds, the time will increase 5s if the port is
associated with floatingip.  When port counts > 20, the duration <=30s,
also increase 5s by floatingip.

This cannot accept in NFV scenario or in some telecommunications
company. Even though the spec[1] want to pre-configure the nework during
live migration, let migration and network configure process in
asynchronous way, but the key issue is not sloved, we also need a
mechanism like provision_block to let l2 and l3 to process in a
synchronize way. And need a way to let nova know about the work is done
in neutron, nova could do the next thing during live migration.

[1]http://specs.openstack.org/openstack/neutron-
specs/specs/pike/portbinding_information_for_nova.html

** Affects: neutron
     Importance: Undecided
         Status: New

** Description changed:

  Nova contains 3 stages when process 1ive migration:
  1. pre_live_migration
  2. migrating
  3. post_live_migration
- The current implement, nova will plug a new vif on the target host. The ovs-agent on the target host will process this new vif, and try to up this port on target host. But the port host_id is src host now.The agent send a rpc to server and return nothing.. 
+ The current implement, nova will plug a new vif on the target host. The ovs-agent on the target host will process this new vif, and try to up this port on target host. But the port host_id is src host now.The agent send a rpc to server and return nothing..
  After nova process the real migration in stage 2. Maybe the flavor of the instance is small and the duration is very short. Then in stage 3, nova call neutron to update the port's host_id of instance. Network interrupt begins. In the whole live migration ,the vm status is always ACTIVE. But users can not login the VM, or the applications running in the VM will be offline for a while. The reason is neutron process the whole traffic is too late. When nova migrate the instance to the target host, and setup the instance by libvirt, the network traffic provided by neutron is not ready, that means we need to verify both l2 and l3 connection are ready for this.
  
- We test in our product env which is the old release Mitaka, the
- interrupt time last depends on the port counts in the router subnets,
- also whether the port is associated with floatingip. When the ports
- counts <20, the interrupt duration <= 8 seconds, the time will increase
- 5s if the port is associated with floatingip.  When port counts > 20,
- the duration <=30s, also increase 5s by floatingip.
+ We test in our product env which is the old release Mitaka(I still think
+ there is the same issue in master), the interrupt time last depends on
+ the port counts in the router subnets, also whether the port is
+ associated with floatingip. When the ports counts <20, the interrupt
+ duration <= 8 seconds, the time will increase 5s if the port is
+ associated with floatingip.  When port counts > 20, the duration <=30s,
+ also increase 5s by floatingip.
  
  This cannot accept in NFV scenario or in some telecommunications
  company. Even though the spec[1] want to pre-configure the nework during
  live migration, let migration and network configure process in
  asynchronous way, but the key issue is not sloved, we also need a
  mechanism like provision_block to let l2 and l3 to process in a
  synchronize way. And need a way to let nova know about the work is done
  in neutron, nova could do the next thing during live migration.
  
- 
- [1]http://specs.openstack.org/openstack/neutron-specs/specs/pike/portbinding_information_for_nova.html
+ [1]http://specs.openstack.org/openstack/neutron-
+ specs/specs/pike/portbinding_information_for_nova.html

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1715340

Title:
  [RFE] reduce the duration of network interrupt during live migration
  in DVR scenario

Status in neutron:
  New

Bug description:
  Nova contains 3 stages when process 1ive migration:
  1. pre_live_migration
  2. migrating
  3. post_live_migration
  The current implement, nova will plug a new vif on the target host. The ovs-agent on the target host will process this new vif, and try to up this port on target host. But the port host_id is src host now.The agent send a rpc to server and return nothing..
  After nova process the real migration in stage 2. Maybe the flavor of the instance is small and the duration is very short. Then in stage 3, nova call neutron to update the port's host_id of instance. Network interrupt begins. In the whole live migration ,the vm status is always ACTIVE. But users can not login the VM, or the applications running in the VM will be offline for a while. The reason is neutron process the whole traffic is too late. When nova migrate the instance to the target host, and setup the instance by libvirt, the network traffic provided by neutron is not ready, that means we need to verify both l2 and l3 connection are ready for this.

  We test in our product env which is the old release Mitaka(I still
  think there is the same issue in master), the interrupt time last
  depends on the port counts in the router subnets, also whether the
  port is associated with floatingip. When the ports counts <20, the
  interrupt duration <= 8 seconds, the time will increase 5s if the port
  is associated with floatingip.  When port counts > 20, the duration
  <=30s, also increase 5s by floatingip.

  This cannot accept in NFV scenario or in some telecommunications
  company. Even though the spec[1] want to pre-configure the nework
  during live migration, let migration and network configure process in
  asynchronous way, but the key issue is not sloved, we also need a
  mechanism like provision_block to let l2 and l3 to process in a
  synchronize way. And need a way to let nova know about the work is
  done in neutron, nova could do the next thing during live migration.

  [1]http://specs.openstack.org/openstack/neutron-
  specs/specs/pike/portbinding_information_for_nova.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1715340/+subscriptions
Follow ups

[Bug 1715340] Re: [RFE] reduce the duration of network interrupt during live migration in DVR scenario
From: Rodolfo Alonso, 2022-12-16