yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #40602
[Bug 1511430] [NEW] live migration does not coordinate VM resume with network readiness
Public bug reported:
When migrating a VM from one host to another in combination with
neutron, VM can resume at destination host while network is not ready
(race condition)
QEMU has a mechanism to send a few RARPs once migration is done and
before resuming.
Nova needs to coordinate with Qemu and neutron (nova/neutron
notification mechanism) to make sure VM is only resumed at destination
host when networking has been properly wired, otherwise the RARPs are
lost, and connectivity to the VM is disrupted until the VM sends any
broadcast message.
log detail (merged from two hosts logs and tcpdumps)
migration from host 29 to 30
2015-10-29 10:54:27.592000 [VMLIFE30] 21476 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Resumed (Lifecycle Event)
2015-10-29 10:54:27.609000 [VMLIFE29] 29022 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Paused (Lifecycle Event)
2015-10-29 10:54:27.636000 [TAP30] tcpdump DEBUG 10:54:27.632047 fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46, length 46
2015-10-29 10:54:27.656000 [TAP29] tcpdump DEBUG tcpdump: pcap_loop: The interface went down
2015-10-29 10:54:27.787000 [TAP30] tcpdump DEBUG 10:54:27.783353
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46
2015-10-29 10:54:27.818000 [FDB30] ovs-fdb DEBUG 62 0
fa:16:3e:50:a3:46 0 # switch associated to VLAN 0, should be "1",
still not tagged, also not propagated to other hosts because vlan0 is
invalid in the OVS implementation
2015-10-29 10:54:28.037000 [TAP30] tcpdump DEBUG 10:54:28.033259
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46
2015-10-29 10:54:28.387000 [TAP30] tcpdump DEBUG 10:54:28.383211
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46
2015-10-29 10:54:28.969000 [VMLIFE29] 29022 INFO nova.compute.manager
[-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Stopped
(Lifecycle Event)
2015-10-29 10:54:29.803000 [OVS30] 21310 DEBUG neutron.agent.linux.utils [req-a33468a6-f259-4324-a132-ab0dd025eeec None]
Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qvo2e6d0f35-cb', 'tag=1'] # wiring is now ready, and after this neutron-openvswitch-agent will notify neutron-server which could notify nova about readiness...
A reproduction ansible script is provided to show how it happens:
https://github.com/mangelajo/oslogmerger/blob/master/contrib/debug-live-
migration/debug-live-migration.yaml
And complete merged output with oslogmerger can be found here:
https://raw.githubusercontent.com/mangelajo/oslogmerger/master/contrib/debug-live-migration/logs/mergedlogs-packets-ovs.log
** Affects: nova
Importance: Undecided
Status: Confirmed
** Changed in: nova
Status: New => Confirmed
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1511430
Title:
live migration does not coordinate VM resume with network readiness
Status in OpenStack Compute (nova):
Confirmed
Bug description:
When migrating a VM from one host to another in combination with
neutron, VM can resume at destination host while network is not ready
(race condition)
QEMU has a mechanism to send a few RARPs once migration is done and
before resuming.
Nova needs to coordinate with Qemu and neutron (nova/neutron
notification mechanism) to make sure VM is only resumed at destination
host when networking has been properly wired, otherwise the RARPs are
lost, and connectivity to the VM is disrupted until the VM sends any
broadcast message.
log detail (merged from two hosts logs and tcpdumps)
migration from host 29 to 30
2015-10-29 10:54:27.592000 [VMLIFE30] 21476 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Resumed (Lifecycle Event)
2015-10-29 10:54:27.609000 [VMLIFE29] 29022 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Paused (Lifecycle Event)
2015-10-29 10:54:27.636000 [TAP30] tcpdump DEBUG 10:54:27.632047 fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46, length 46
2015-10-29 10:54:27.656000 [TAP29] tcpdump DEBUG tcpdump: pcap_loop: The interface went down
2015-10-29 10:54:27.787000 [TAP30] tcpdump DEBUG 10:54:27.783353
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46
2015-10-29 10:54:27.818000 [FDB30] ovs-fdb DEBUG 62 0
fa:16:3e:50:a3:46 0 # switch associated to VLAN 0, should be "1",
still not tagged, also not propagated to other hosts because vlan0 is
invalid in the OVS implementation
2015-10-29 10:54:28.037000 [TAP30] tcpdump DEBUG 10:54:28.033259
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46
2015-10-29 10:54:28.387000 [TAP30] tcpdump DEBUG 10:54:28.383211
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46
2015-10-29 10:54:28.969000 [VMLIFE29] 29022 INFO nova.compute.manager
[-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Stopped
(Lifecycle Event)
2015-10-29 10:54:29.803000 [OVS30] 21310 DEBUG neutron.agent.linux.utils [req-a33468a6-f259-4324-a132-ab0dd025eeec None]
Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qvo2e6d0f35-cb', 'tag=1'] # wiring is now ready, and after this neutron-openvswitch-agent will notify neutron-server which could notify nova about readiness...
A reproduction ansible script is provided to show how it happens:
https://github.com/mangelajo/oslogmerger/blob/master/contrib/debug-
live-migration/debug-live-migration.yaml
And complete merged output with oslogmerger can be found here:
https://raw.githubusercontent.com/mangelajo/oslogmerger/master/contrib/debug-live-migration/logs/mergedlogs-packets-ovs.log
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1511430/+subscriptions
Follow ups