yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #57592
[Bug 1631647] [NEW] Network downtime during live migration through routers
Public bug reported:
neutron/master (close to stable/newton)
VXLAN networks with simple network node (not DVR)
There is network down time of several seconds during a live migration.
The amount of time depends on when the VM resumes on the target host
versus when the migration ‘completes’.
When a live migration occurs, there is a point in its life cycle where
it pauses on the source and starts up (or resumes) on the target. At
that point, the migration isn’t complete, the system has determined it
is now best to be running on the target. This of course varies per
hypervisor, but that is the general flow for most hypervisors.
So during the migration the port goes through a few states.
1) Pre migration, its tied solely to the source host.
2) During migration, its tied to the source host. The port profile has a ‘migrating_to’ attribute that identifies the target host
3) Post migration, the port is tied solely to the target host.
The OVS agent handles the migration well. It detects the port, sees the UUID, and treats the port properly. But things like the router don’t seem to handle it properly, at least in my testing.
It seems only once the VM hits step 3 (post migration, where nova
updates the port to be on the target host solely) does the routing
information get updated in the router.
In fact, its kinda interesting. I’ve been running a constant ping during the live migration through the router and watching it on both sides with tcpdump. When it resumes on the target, but live migration is not completed the following happens:
- Ping request goes out from target server
- Goes through out the router
- Comes back into the router
- Gets sent to the source server
I’m not sure if this is somehow specific to vxlan. I haven’t had a
chance to try Geneve yet.
This could impact projects like Watcher which will be using the live-
migration to constantly optimize the system. But that could be
undesirable to optimize because it would introduce down time on the
workloads being moved around.
If the time between a VM resume and live migration complete is minimal,
then the impact can be quite small (couple seconds). If KVM uses post-
copy, it should be susceptible to it.
http://wiki.qemu.org/Features/PostCopyLiveMigration
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1631647
Title:
Network downtime during live migration through routers
Status in neutron:
New
Bug description:
neutron/master (close to stable/newton)
VXLAN networks with simple network node (not DVR)
There is network down time of several seconds during a live migration.
The amount of time depends on when the VM resumes on the target host
versus when the migration ‘completes’.
When a live migration occurs, there is a point in its life cycle where
it pauses on the source and starts up (or resumes) on the target. At
that point, the migration isn’t complete, the system has determined it
is now best to be running on the target. This of course varies per
hypervisor, but that is the general flow for most hypervisors.
So during the migration the port goes through a few states.
1) Pre migration, its tied solely to the source host.
2) During migration, its tied to the source host. The port profile has a ‘migrating_to’ attribute that identifies the target host
3) Post migration, the port is tied solely to the target host.
The OVS agent handles the migration well. It detects the port, sees the UUID, and treats the port properly. But things like the router don’t seem to handle it properly, at least in my testing.
It seems only once the VM hits step 3 (post migration, where nova
updates the port to be on the target host solely) does the routing
information get updated in the router.
In fact, its kinda interesting. I’ve been running a constant ping during the live migration through the router and watching it on both sides with tcpdump. When it resumes on the target, but live migration is not completed the following happens:
- Ping request goes out from target server
- Goes through out the router
- Comes back into the router
- Gets sent to the source server
I’m not sure if this is somehow specific to vxlan. I haven’t had a
chance to try Geneve yet.
This could impact projects like Watcher which will be using the live-
migration to constantly optimize the system. But that could be
undesirable to optimize because it would introduce down time on the
workloads being moved around.
If the time between a VM resume and live migration complete is
minimal, then the impact can be quite small (couple seconds). If KVM
uses post-copy, it should be susceptible to it.
http://wiki.qemu.org/Features/PostCopyLiveMigration
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1631647/+subscriptions
Follow ups