← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1905944] [NEW] live-migration job not aborted when live_monitor thread fails

 

Public bug reported:

Description
===========

During live migration, a monitoring thread poll each 0.5s libvirt job
progress and update db with with jobs stats. If there control pane issue
like DB/RPC or libvirt unexpected Exception (timeout)
exception handling do not properly interrupt libvirt job.

Steps to reproduce
==================
On a multinode devstack master.

#spawn instance on source_host
1) openstack server create --flavor m1.small --image cirros-0.5.1-x86_64-disk \
--nic net-id=private inst

#ignite live block migration on dest_host, wait a bit( to be in monitoring thread),
# and trigger an issue on DB for ex.
2) nova live-migration inst ; sleep 6 ; sudo service mysql restart

3) On source host you can survey libvirt job progess until it complete and disappear
because libvirt resume guest on target host(starting writting data on target disk)
source_host$ watch -n 1 virsh domjobinfo instance-0000000d

4) on dest host you will find instance active
dest_host$ virsh list
 Id   Name                State
-----------------------------------
 20   instance-0000000d   running

5) nova show inst show instance still on source host.
$nova show inst | grep host
| OS-EXT-SRV-ATTR:host                 | source_host


if admin try to recover the instance on source on as it in on nova DB,
we can fall in split-brain where 2 qemu running on two different disks on two host
(true story..)

Expected result
===============
If issue happen we must at least ensure that libvirt job is interrupted, avoiding
the guest resume on target host.

Actual result
=============
If issue happen libvirt job continue and bring up guest on target host,
nova still consider it on source.

** Affects: nova
     Importance: Undecided
     Assignee: Alexandre arents (aarents)
         Status: New

** Changed in: nova
     Assignee: (unassigned) => Alexandre arents (aarents)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1905944

Title:
  live-migration job not aborted when live_monitor thread fails

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  During live migration, a monitoring thread poll each 0.5s libvirt job
  progress and update db with with jobs stats. If there control pane issue
  like DB/RPC or libvirt unexpected Exception (timeout)
  exception handling do not properly interrupt libvirt job.

  Steps to reproduce
  ==================
  On a multinode devstack master.

  #spawn instance on source_host
  1) openstack server create --flavor m1.small --image cirros-0.5.1-x86_64-disk \
  --nic net-id=private inst

  #ignite live block migration on dest_host, wait a bit( to be in monitoring thread),
  # and trigger an issue on DB for ex.
  2) nova live-migration inst ; sleep 6 ; sudo service mysql restart

  3) On source host you can survey libvirt job progess until it complete and disappear
  because libvirt resume guest on target host(starting writting data on target disk)
  source_host$ watch -n 1 virsh domjobinfo instance-0000000d

  4) on dest host you will find instance active
  dest_host$ virsh list
   Id   Name                State
  -----------------------------------
   20   instance-0000000d   running

  5) nova show inst show instance still on source host.
  $nova show inst | grep host
  | OS-EXT-SRV-ATTR:host                 | source_host

  
  if admin try to recover the instance on source on as it in on nova DB,
  we can fall in split-brain where 2 qemu running on two different disks on two host
  (true story..)

  Expected result
  ===============
  If issue happen we must at least ensure that libvirt job is interrupted, avoiding
  the guest resume on target host.

  Actual result
  =============
  If issue happen libvirt job continue and bring up guest on target host,
  nova still consider it on source.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1905944/+subscriptions


Follow ups