yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1573875] Re: Nova able to start VM 2 times after failed to live migrate it

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: "Markus Zoeller \(markus_z\)" <mzoeller@xxxxxxxxxx>
Date: Wed, 27 Apr 2016 18:19:20 -0000
Reply-to: Bug 1573875 <1573875@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
OK, after re-reading the bug description, I think upstream Nova should
take a look to address this issue:

> - If nova able to start instances two times with same rbd block
> device, it's a really big hole in the system I think [...]

I change the description of the bug report to make that clear.

** Changed in: nova
       Status: Invalid => New

** Summary changed:

- Nova able to start VM 2 times after failed to live migrate it
+ The same ceph rbd device is used by multiple instances

** Tags added: ceph libvirt live-migration

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1573875

Title:
  The same ceph rbd device is used by multiple instances

Status in OpenStack Compute (nova):
  New

Bug description:
  Hi,

  I've faced a strange problem with nova.
  A few enviromental details:
   - We use Ubuntu 14.04 LTS
   - We use Kilo from Ubuntu cloud archive
   - We use KVM as Hypervisor with the stocked qemu 2.2
   - We got Ceph as shared storage with libvirt-rbd devices
   - OVS neutron based networking, but it's all the same with other solutions I think.

  So, the workflow, which need to reproduce the bug:
   - Start a Windows guest (Linux distros not affected as I saw)
   - Live migrate this VM to another host (okay, I know, it's not fit 100% in cloud conception, but we must use it)

  As happend then, is a really wrong behavior:
   - The VM starts to migrate (virsh list shows it in a new host)
   - On the source side, virsh list tells me, the instance is stopped
   - After a few second, the destination host just remove the instance, and the source change it's state back to running
   - The network comes unavailable
   - The horizon reports, the instance is in shut off state and it's definietly not (the VNC is still available for example)
   - User can click on 'Start instance' button, and the instance will be started at the destination 
   - We see those lines in a specified libvirt log: "qemu-system-x86_64: load of migration failed: Invalid argument"

  After a few google search whit this error, i've found this site: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1472500
  It's not the exact error, but it's tells us a really important fact: those errors came with qemu 2.2, and it's had been fixed in 2.3...

  First of all, I've installed 2 CentOS compute node, which cames with
  qemu 2.3 by default, and the Windows migration started to work as
  Linux guests did before.

  Unfortunately, we must use Ubuntu, so we needed to find a workaround,
  which had been done yesterday...

  What I did:
   - Added Mitaka repository (which came out two days before)
   - Run this command (I cannot dist-upgrade openstack now): apt-get install qemu-system qemu-system-arm qemu-system-common qemu-system-mips qemu-system-misc qemu-system-ppc qemu-system-sparc qemu-system-x86 qemu-utils seabios libvirt-bin
   - Let the qemu 2.5 installed
   - The migration tests shows us, this new packages solves the issue

  What I want/advice, to repair this:
   - First of all, it would be nice to install qemu 2.5 with the original kilo repository, and I be able to upgrade without any 'quick and dirty' method (add-remove Mitaka repo until installing qemu). It is ASAP to us, cause if we not get this until the  next weekend, i had to choose the quick and dirty way (but don't want to rush anybody... just telling :) )

   - If nova able to start instances two times with same rbd block
  device, it's a really big hole in the system I think... we just
  corrupted 2 test Windows 7 guest with a few clicks... Some security
  check should be implementet, which collects the instances (and their
  states) from kvm at any VM starting, and if the algorithm sees, there
  are guest running with the same name (or some kind of uuid maybe)
  it's just not starting another copy...

   - Some kind of checking also would usefull, which automatically
  checks and compare the VM states in the database, and also in
  hypervisors side in a given interval (this check may can be disabled,
  and checking interval should be able to configured imho)

  I've not found any clue, that those things in nova side are repaired
  previously in liberty or mitaka... am I right, ot just someting avoid
  my attention?

  If any further information needed, feel free to ask :)

  Regards, 
   Peter

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1573875/+subscriptions
References

[Bug 1573875] [NEW] Nova able to start VM 2 times after failed to live migrate it
From: Peter, 2016-04-23