yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1511430] [NEW] live migration does not coordinate VM resume with network readiness

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Miguel Angel Ajo <majopela@xxxxxxxxxx>
Date: Thu, 29 Oct 2015 15:26:49 -0000
Reply-to: Bug 1511430 <1511430@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

When migrating a VM from one host to another in combination with
neutron,  VM can resume at destination host while network is not ready
(race condition)

QEMU has a mechanism to send a few RARPs once migration is done and
before resuming.

Nova needs to coordinate with Qemu and neutron (nova/neutron
notification mechanism) to make sure VM is only resumed at destination
host when networking has been properly wired, otherwise the RARPs are
lost, and connectivity to the VM is disrupted until the VM sends any
broadcast message.

log detail (merged from two hosts logs and tcpdumps)

migration from host 29 to 30

2015-10-29 10:54:27.592000 [VMLIFE30] 21476 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Resumed (Lifecycle Event)
2015-10-29 10:54:27.609000 [VMLIFE29] 29022 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Paused (Lifecycle Event)
2015-10-29 10:54:27.636000 [TAP30] tcpdump DEBUG 10:54:27.632047 fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46, length 46
2015-10-29 10:54:27.656000 [TAP29] tcpdump DEBUG tcpdump: pcap_loop: The interface went down

2015-10-29 10:54:27.787000 [TAP30] tcpdump DEBUG 10:54:27.783353
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46

2015-10-29 10:54:27.818000 [FDB30] ovs-fdb DEBUG 62     0
fa:16:3e:50:a3:46    0  # switch associated to VLAN 0, should be "1",
still not tagged, also not propagated to other hosts because vlan0 is
invalid in the OVS implementation

2015-10-29 10:54:28.037000 [TAP30] tcpdump DEBUG 10:54:28.033259
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46

2015-10-29 10:54:28.387000 [TAP30] tcpdump DEBUG 10:54:28.383211
fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
length 46

2015-10-29 10:54:28.969000 [VMLIFE29] 29022 INFO nova.compute.manager
[-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Stopped
(Lifecycle Event)

2015-10-29 10:54:29.803000 [OVS30] 21310 DEBUG neutron.agent.linux.utils [req-a33468a6-f259-4324-a132-ab0dd025eeec None]
                                        Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qvo2e6d0f35-cb', 'tag=1']  # wiring is now ready, and after this neutron-openvswitch-agent will notify neutron-server which could notify nova about readiness...


A reproduction ansible script is provided to show how it happens:

https://github.com/mangelajo/oslogmerger/blob/master/contrib/debug-live-
migration/debug-live-migration.yaml

And complete merged output with oslogmerger can be found here:
https://raw.githubusercontent.com/mangelajo/oslogmerger/master/contrib/debug-live-migration/logs/mergedlogs-packets-ovs.log

** Affects: nova
     Importance: Undecided
         Status: Confirmed

** Changed in: nova
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1511430

Title:
  live migration does not coordinate VM resume with network readiness

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  When migrating a VM from one host to another in combination with
  neutron,  VM can resume at destination host while network is not ready
  (race condition)

  QEMU has a mechanism to send a few RARPs once migration is done and
  before resuming.

  Nova needs to coordinate with Qemu and neutron (nova/neutron
  notification mechanism) to make sure VM is only resumed at destination
  host when networking has been properly wired, otherwise the RARPs are
  lost, and connectivity to the VM is disrupted until the VM sends any
  broadcast message.

  log detail (merged from two hosts logs and tcpdumps)

  migration from host 29 to 30

  2015-10-29 10:54:27.592000 [VMLIFE30] 21476 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Resumed (Lifecycle Event)
  2015-10-29 10:54:27.609000 [VMLIFE29] 29022 INFO nova.compute.manager [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Paused (Lifecycle Event)
  2015-10-29 10:54:27.636000 [TAP30] tcpdump DEBUG 10:54:27.632047 fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length 60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46, length 46
  2015-10-29 10:54:27.656000 [TAP29] tcpdump DEBUG tcpdump: pcap_loop: The interface went down

  2015-10-29 10:54:27.787000 [TAP30] tcpdump DEBUG 10:54:27.783353
  fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
  60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
  length 46

  2015-10-29 10:54:27.818000 [FDB30] ovs-fdb DEBUG 62     0
  fa:16:3e:50:a3:46    0  # switch associated to VLAN 0, should be "1",
  still not tagged, also not propagated to other hosts because vlan0 is
  invalid in the OVS implementation

  2015-10-29 10:54:28.037000 [TAP30] tcpdump DEBUG 10:54:28.033259
  fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
  60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
  length 46

  2015-10-29 10:54:28.387000 [TAP30] tcpdump DEBUG 10:54:28.383211
  fa:16:3e:50:a3:46 > Broadcast, ethertype Reverse ARP (0x8035), length
  60: Reverse Request who-is fa:16:3e:50:a3:46 tell fa:16:3e:50:a3:46,
  length 46

  2015-10-29 10:54:28.969000 [VMLIFE29] 29022 INFO nova.compute.manager
  [-] [instance: a18a5824-4215-4e24-bcfc-cb9f89f6bcbd] VM Stopped
  (Lifecycle Event)

  2015-10-29 10:54:29.803000 [OVS30] 21310 DEBUG neutron.agent.linux.utils [req-a33468a6-f259-4324-a132-ab0dd025eeec None]
                                          Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qvo2e6d0f35-cb', 'tag=1']  # wiring is now ready, and after this neutron-openvswitch-agent will notify neutron-server which could notify nova about readiness...

  
  A reproduction ansible script is provided to show how it happens:

  https://github.com/mangelajo/oslogmerger/blob/master/contrib/debug-
  live-migration/debug-live-migration.yaml

  And complete merged output with oslogmerger can be found here:
  https://raw.githubusercontent.com/mangelajo/oslogmerger/master/contrib/debug-live-migration/logs/mergedlogs-packets-ovs.log

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1511430/+subscriptions
Follow ups

[Bug 1511430] Re: live migration does not coordinate VM resume with network readiness
From: Gaudenz Steinlin, 2016-08-22