yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1781878] [NEW] VM fails to boot after evacuation when it uses ceph disk

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Vahid Ashrafian <vahid.arn@xxxxxxxxx>
Date: Mon, 16 Jul 2018 08:33:33 -0000
Reply-to: Bug 1781878 <1781878@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description
===========
If we use Ceph RBD as storage backend and Ceph Disks (image) have exclusive-lock feature, when a compute node goes down, the evacuation process works fine and nova detects the VM has a disk on a shared storage, so it rebuild the VM on another node. But after the evacuation, although nova marks the instance as active, the instance fails to boot and encounter a kernel panic caused by inability of the kernel to write on disk.

It is possible to disable exclusive-lock feature on Ceph and the
evacuation process works fine, but it needed to be enabled in some use-
cases.

Also there is a workaround for this problem, we were able to evacuate an
instance successfully by removing the lock of the disk to the old
instance using rbd command line, but I think it should be done in the
code of rbd driver in Nova and Cinder.

The problem seams to be with the exclusive-lock feature. when a disk has
exclusive-lock enabled, as soon as a client (the VM) connects and writes
on disk, Ceph locks the disk for the client (lock-on-write) (also if we
enable lock-on-read in Ceph conf, it would lock the disk on the first
read). In the evacuation process, since there is no defined process to
remove the exclusive-lock from the old VM, when the new VM tries to
write on the disk, it fails to write since it can't get the lock.

I found similar problem reported for kubernetes when a node goes down and the system tries to attach its volume to new Pod.
https://github.com/openshift/origin/issues/7983#issuecomment-243736437
There, some people proposed before bringing up the new instance, first blacklist the old instance, then unlock the disk and lock it for the new one.

Steps to reproduce
==================
* Create an instance (with ceph storage backend) and wait for boot
* Poweroff the Host of the instance
* Evacuate the instance
* Check the Console in the dashboard

Expected result
===============
The instance should boot without any problem.

Actual result
=============
The instance encounter kernel panic and fails to boot.

Environment
===========
1. Openstack Queens, Nova 17.0.2
2. hypervisor: Libvirt (v4.0.0) + KVM
2. Storage: 12.2.4

Logs & Configs
==============
Console log of the instance after it evacuation:

[    2.352586] blk_update_request: I/O error, dev vda, sector 18436
[    2.357199] Buffer I/O error on dev vda1, logical block 2, lost async page write
[    2.363736] blk_update_request: I/O error, dev vda, sector 18702
[    2.431927] Buffer I/O error on dev vda1, logical block 135, lost async page write
[    2.442673] blk_update_request: I/O error, dev vda, sector 18708
[    2.449862] Buffer I/O error on dev vda1, logical block 138, lost async page write
[    2.460061] blk_update_request: I/O error, dev vda, sector 18718
[    2.468022] Buffer I/O error on dev vda1, logical block 143, lost async page write
[    2.477360] blk_update_request: I/O error, dev vda, sector 18722
[    2.484106] Buffer I/O error on dev vda1, logical block 145, lost async page write
[    2.493227] blk_update_request: I/O error, dev vda, sector 18744
[    2.499642] Buffer I/O error on dev vda1, logical block 156, lost async page write
[    2.505792] blk_update_request: I/O error, dev vda, sector 35082
[    2.510281] Buffer I/O error on dev vda1, logical block 8325, lost async page write
[    2.516296] Buffer I/O error on dev vda1, logical block 8326, lost async page write
[    2.522749] blk_update_request: I/O error, dev vda, sector 35096
[    2.527483] Buffer I/O error on dev vda1, logical block 8332, lost async page write
[    2.533616] Buffer I/O error on dev vda1, logical block 8333, lost async page write
[    2.540085] blk_update_request: I/O error, dev vda, sector 35104
[    2.545149] blk_update_request: I/O error, dev vda, sector 36236
[    2.549948] JBD2: recovery failed
[    2.552989] EXT4-fs (vda1): error loading journal
[    2.557228] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
[    2.563139] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
[    2.704190] JBD2: recovery failed
[    2.708709] EXT4-fs (vda1): error loading journal
[    2.714963] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
mount: mounting /dev/vda1 on /newroot failed: Invalid argument
umount: can't umount /dev/vda1: Invalid argument
mcb [info=LABEL=cirros-rootfs dev=/dev/vda1 target=/newroot unmount=cbfail callback=check_sbin_init ret=1: failed to unmount
[    2.886773] JBD2: recovery failed
[    2.892670] EXT4-fs (vda1): error loading journal
[    2.900580] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
[    2.911330] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
[    3.044295] JBD2: recovery failed
[    3.050363] EXT4-fs (vda1): error loading journal
[    3.058689] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
mount: mounting /dev/vda1 on /newroot failed: Invalid argument
info: copying initramfs to /dev/vda1
mount: can't find /newroot in /proc/mounts
info: initramfs loading root from /dev/vda1
BusyBox v1.23.2 (2017-11-20 02:37:12 UTC) multi-call binary.

Usage: switch_root [-c /dev/console] NEW_ROOT NEW_INIT [ARGS]

Free initramfs and switch to another root fs:
chroot to NEW_ROOT, delete all in /, move NEW_ROOT to /,
execute NEW_INIT. PID must be 1. NEW_ROOT must be a mountpoint.

 -c DEV Reopen stdio to DEV after switch

[    3.170388] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[    3.170388]
[    3.186305] CPU: 0 PID: 1 Comm: switch_root Not tainted 4.4.0-28-generic #47-Ubuntu
[    3.198826] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
[    3.213538]  0000000000000086 000000004cbc7242 ffff88001f63be10 ffffffff813eb1a3
[    3.227588]  ffffffff81cb10d8 ffff88001f63bea8 ffff88001f63be98 ffffffff8118bf57
[    3.241405]  ffff880000000010 ffff88001f63bea8 ffff88001f63be40 000000004cbc7242
[    3.251820] Call Trace:
[    3.254191]  [<ffffffff813eb1a3>] dump_stack+0x63/0x90
[    3.258257]  [<ffffffff8118bf57>] panic+0xd3/0x215
[    3.261865]  [<ffffffff81184e1e>] ? perf_event_exit_task+0xbe/0x350
[    3.266173]  [<ffffffff81084541>] do_exit+0xae1/0xaf0
[    3.269989]  [<ffffffff8106b554>] ? __do_page_fault+0x1b4/0x400
[    3.274408]  [<ffffffff810845d3>] do_group_exit+0x43/0xb0
[    3.278557]  [<ffffffff81084654>] SyS_exit_group+0x14/0x20
[    3.282693]  [<ffffffff818276b2>] entry_SYSCALL_64_fastpath+0x16/0x71
[    3.290709] Kernel Offset: disabled
[    3.293770] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[    3.293770]

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  Description
  ===========
  If we use Ceph RBD as storage backend and Ceph Disks (image) have exclusive-lock feature, when a compute node goes down, the evacuation process works fine and nova detects the VM has a disk on a shared storage, so it rebuild the VM on another node. But after the evacuation, although nova marks the instance as active, the instance fails to boot and encounter a kernel panic caused by inability of the kernel to write on disk.
  
  It is possible to disable exclusive-lock feature on Ceph and the
  evacuation process works fine, but it needed to be enabled in some use-
  cases.
  
  Also there is a workaround for this problem, we were able to evacuate an
  instance successfully by removing the lock of the disk to the old
  instance using rbd command line, but I think it should be done in the
  code of rbd driver in Nova and Cinder.
  
  The problem seams to be with the exclusive-lock feature. when a disk has
  exclusive-lock enabled, as soon as a client (the VM) connects and writes
  on disk, Ceph locks the disk for the client (lock-on-write) (also if we
  enable lock-on-read in Ceph conf, it would lock the disk on the first
  read). In the evacuation process, since there is no defined process to
  remove the exclusive-lock from the old VM, when the new VM tries to
  write on the disk, it fails to write since it can't get the lock.
  
- I found similar problem reported for kubernetes when a node goes down and the system tries to attach it's volume to new Pod. 
+ I found similar problem reported for kubernetes when a node goes down and the system tries to attach its volume to new Pod.
  https://github.com/openshift/origin/issues/7983#issuecomment-243736437
  There, some people proposed before bringing up the new instance, first blacklist the old instance, then unlock the disk and lock it for the new instance.
- 
  
  Steps to reproduce
  ==================
  * Create an instance (with ceph storage backend) and wait for boot
  * Poweroff the Host of the instance
  * Evacuate the instance
  * Check the Console in the dashboard
- 
  
  Expected result
  ===============
  The instance should boot without any problem.
  
  Actual result
  =============
  The instance encounter kernel panic and fails to boot.
  
  Environment
  ===========
  1. Openstack Queens, Nova 17.0.2
- 2. hypervisor: Libvirt (v4.0.0) + KVM 
+ 2. hypervisor: Libvirt (v4.0.0) + KVM
  2. Storage: 12.2.4
  
  Logs & Configs
  ==============
  Console log of the instance after it evacuation:
  
  [    2.352586] blk_update_request: I/O error, dev vda, sector 18436
  [    2.357199] Buffer I/O error on dev vda1, logical block 2, lost async page write
  [    2.363736] blk_update_request: I/O error, dev vda, sector 18702
  [    2.431927] Buffer I/O error on dev vda1, logical block 135, lost async page write
  [    2.442673] blk_update_request: I/O error, dev vda, sector 18708
  [    2.449862] Buffer I/O error on dev vda1, logical block 138, lost async page write
  [    2.460061] blk_update_request: I/O error, dev vda, sector 18718
  [    2.468022] Buffer I/O error on dev vda1, logical block 143, lost async page write
  [    2.477360] blk_update_request: I/O error, dev vda, sector 18722
  [    2.484106] Buffer I/O error on dev vda1, logical block 145, lost async page write
  [    2.493227] blk_update_request: I/O error, dev vda, sector 18744
  [    2.499642] Buffer I/O error on dev vda1, logical block 156, lost async page write
  [    2.505792] blk_update_request: I/O error, dev vda, sector 35082
  [    2.510281] Buffer I/O error on dev vda1, logical block 8325, lost async page write
  [    2.516296] Buffer I/O error on dev vda1, logical block 8326, lost async page write
  [    2.522749] blk_update_request: I/O error, dev vda, sector 35096
  [    2.527483] Buffer I/O error on dev vda1, logical block 8332, lost async page write
  [    2.533616] Buffer I/O error on dev vda1, logical block 8333, lost async page write
  [    2.540085] blk_update_request: I/O error, dev vda, sector 35104
  [    2.545149] blk_update_request: I/O error, dev vda, sector 36236
  [    2.549948] JBD2: recovery failed
  [    2.552989] EXT4-fs (vda1): error loading journal
  [    2.557228] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  [    2.563139] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
  [    2.704190] JBD2: recovery failed
  [    2.708709] EXT4-fs (vda1): error loading journal
  [    2.714963] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  mount: mounting /dev/vda1 on /newroot failed: Invalid argument
  umount: can't umount /dev/vda1: Invalid argument
  mcb [info=LABEL=cirros-rootfs dev=/dev/vda1 target=/newroot unmount=cbfail callback=check_sbin_init ret=1: failed to unmount
  [    2.886773] JBD2: recovery failed
  [    2.892670] EXT4-fs (vda1): error loading journal
  [    2.900580] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  [    2.911330] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
  [    3.044295] JBD2: recovery failed
  [    3.050363] EXT4-fs (vda1): error loading journal
  [    3.058689] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  mount: mounting /dev/vda1 on /newroot failed: Invalid argument
  info: copying initramfs to /dev/vda1
  mount: can't find /newroot in /proc/mounts
  info: initramfs loading root from /dev/vda1
  BusyBox v1.23.2 (2017-11-20 02:37:12 UTC) multi-call binary.
  
  Usage: switch_root [-c /dev/console] NEW_ROOT NEW_INIT [ARGS]
  
  Free initramfs and switch to another root fs:
  chroot to NEW_ROOT, delete all in /, move NEW_ROOT to /,
  execute NEW_INIT. PID must be 1. NEW_ROOT must be a mountpoint.
  
-         -c DEV  Reopen stdio to DEV after switch
+  -c DEV Reopen stdio to DEV after switch
  
  [    3.170388] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
- [    3.170388] 
+ [    3.170388]
  [    3.186305] CPU: 0 PID: 1 Comm: switch_root Not tainted 4.4.0-28-generic #47-Ubuntu
  [    3.198826] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
  [    3.213538]  0000000000000086 000000004cbc7242 ffff88001f63be10 ffffffff813eb1a3
  [    3.227588]  ffffffff81cb10d8 ffff88001f63bea8 ffff88001f63be98 ffffffff8118bf57
  [    3.241405]  ffff880000000010 ffff88001f63bea8 ffff88001f63be40 000000004cbc7242
  [    3.251820] Call Trace:
  [    3.254191]  [<ffffffff813eb1a3>] dump_stack+0x63/0x90
  [    3.258257]  [<ffffffff8118bf57>] panic+0xd3/0x215
  [    3.261865]  [<ffffffff81184e1e>] ? perf_event_exit_task+0xbe/0x350
  [    3.266173]  [<ffffffff81084541>] do_exit+0xae1/0xaf0
  [    3.269989]  [<ffffffff8106b554>] ? __do_page_fault+0x1b4/0x400
  [    3.274408]  [<ffffffff810845d3>] do_group_exit+0x43/0xb0
  [    3.278557]  [<ffffffff81084654>] SyS_exit_group+0x14/0x20
  [    3.282693]  [<ffffffff818276b2>] entry_SYSCALL_64_fastpath+0x16/0x71
  [    3.290709] Kernel Offset: disabled
  [    3.293770] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
  [    3.293770]

** Description changed:

  Description
  ===========
  If we use Ceph RBD as storage backend and Ceph Disks (image) have exclusive-lock feature, when a compute node goes down, the evacuation process works fine and nova detects the VM has a disk on a shared storage, so it rebuild the VM on another node. But after the evacuation, although nova marks the instance as active, the instance fails to boot and encounter a kernel panic caused by inability of the kernel to write on disk.
  
  It is possible to disable exclusive-lock feature on Ceph and the
  evacuation process works fine, but it needed to be enabled in some use-
  cases.
  
  Also there is a workaround for this problem, we were able to evacuate an
  instance successfully by removing the lock of the disk to the old
  instance using rbd command line, but I think it should be done in the
  code of rbd driver in Nova and Cinder.
  
  The problem seams to be with the exclusive-lock feature. when a disk has
  exclusive-lock enabled, as soon as a client (the VM) connects and writes
  on disk, Ceph locks the disk for the client (lock-on-write) (also if we
  enable lock-on-read in Ceph conf, it would lock the disk on the first
  read). In the evacuation process, since there is no defined process to
  remove the exclusive-lock from the old VM, when the new VM tries to
  write on the disk, it fails to write since it can't get the lock.
  
  I found similar problem reported for kubernetes when a node goes down and the system tries to attach its volume to new Pod.
  https://github.com/openshift/origin/issues/7983#issuecomment-243736437
- There, some people proposed before bringing up the new instance, first blacklist the old instance, then unlock the disk and lock it for the new instance.
+ There, some people proposed before bringing up the new instance, first blacklist the old instance, then unlock the disk and lock it for the new one.
  
  Steps to reproduce
  ==================
  * Create an instance (with ceph storage backend) and wait for boot
  * Poweroff the Host of the instance
  * Evacuate the instance
  * Check the Console in the dashboard
  
  Expected result
  ===============
  The instance should boot without any problem.
  
  Actual result
  =============
  The instance encounter kernel panic and fails to boot.
  
  Environment
  ===========
  1. Openstack Queens, Nova 17.0.2
  2. hypervisor: Libvirt (v4.0.0) + KVM
  2. Storage: 12.2.4
  
  Logs & Configs
  ==============
  Console log of the instance after it evacuation:
  
  [    2.352586] blk_update_request: I/O error, dev vda, sector 18436
  [    2.357199] Buffer I/O error on dev vda1, logical block 2, lost async page write
  [    2.363736] blk_update_request: I/O error, dev vda, sector 18702
  [    2.431927] Buffer I/O error on dev vda1, logical block 135, lost async page write
  [    2.442673] blk_update_request: I/O error, dev vda, sector 18708
  [    2.449862] Buffer I/O error on dev vda1, logical block 138, lost async page write
  [    2.460061] blk_update_request: I/O error, dev vda, sector 18718
  [    2.468022] Buffer I/O error on dev vda1, logical block 143, lost async page write
  [    2.477360] blk_update_request: I/O error, dev vda, sector 18722
  [    2.484106] Buffer I/O error on dev vda1, logical block 145, lost async page write
  [    2.493227] blk_update_request: I/O error, dev vda, sector 18744
  [    2.499642] Buffer I/O error on dev vda1, logical block 156, lost async page write
  [    2.505792] blk_update_request: I/O error, dev vda, sector 35082
  [    2.510281] Buffer I/O error on dev vda1, logical block 8325, lost async page write
  [    2.516296] Buffer I/O error on dev vda1, logical block 8326, lost async page write
  [    2.522749] blk_update_request: I/O error, dev vda, sector 35096
  [    2.527483] Buffer I/O error on dev vda1, logical block 8332, lost async page write
  [    2.533616] Buffer I/O error on dev vda1, logical block 8333, lost async page write
  [    2.540085] blk_update_request: I/O error, dev vda, sector 35104
  [    2.545149] blk_update_request: I/O error, dev vda, sector 36236
  [    2.549948] JBD2: recovery failed
  [    2.552989] EXT4-fs (vda1): error loading journal
  [    2.557228] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  [    2.563139] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
  [    2.704190] JBD2: recovery failed
  [    2.708709] EXT4-fs (vda1): error loading journal
  [    2.714963] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  mount: mounting /dev/vda1 on /newroot failed: Invalid argument
  umount: can't umount /dev/vda1: Invalid argument
  mcb [info=LABEL=cirros-rootfs dev=/dev/vda1 target=/newroot unmount=cbfail callback=check_sbin_init ret=1: failed to unmount
  [    2.886773] JBD2: recovery failed
  [    2.892670] EXT4-fs (vda1): error loading journal
  [    2.900580] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  [    2.911330] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
  [    3.044295] JBD2: recovery failed
  [    3.050363] EXT4-fs (vda1): error loading journal
  [    3.058689] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  mount: mounting /dev/vda1 on /newroot failed: Invalid argument
  info: copying initramfs to /dev/vda1
  mount: can't find /newroot in /proc/mounts
  info: initramfs loading root from /dev/vda1
  BusyBox v1.23.2 (2017-11-20 02:37:12 UTC) multi-call binary.
  
  Usage: switch_root [-c /dev/console] NEW_ROOT NEW_INIT [ARGS]
  
  Free initramfs and switch to another root fs:
  chroot to NEW_ROOT, delete all in /, move NEW_ROOT to /,
  execute NEW_INIT. PID must be 1. NEW_ROOT must be a mountpoint.
  
   -c DEV Reopen stdio to DEV after switch
  
  [    3.170388] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
  [    3.170388]
  [    3.186305] CPU: 0 PID: 1 Comm: switch_root Not tainted 4.4.0-28-generic #47-Ubuntu
  [    3.198826] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
  [    3.213538]  0000000000000086 000000004cbc7242 ffff88001f63be10 ffffffff813eb1a3
  [    3.227588]  ffffffff81cb10d8 ffff88001f63bea8 ffff88001f63be98 ffffffff8118bf57
  [    3.241405]  ffff880000000010 ffff88001f63bea8 ffff88001f63be40 000000004cbc7242
  [    3.251820] Call Trace:
  [    3.254191]  [<ffffffff813eb1a3>] dump_stack+0x63/0x90
  [    3.258257]  [<ffffffff8118bf57>] panic+0xd3/0x215
  [    3.261865]  [<ffffffff81184e1e>] ? perf_event_exit_task+0xbe/0x350
  [    3.266173]  [<ffffffff81084541>] do_exit+0xae1/0xaf0
  [    3.269989]  [<ffffffff8106b554>] ? __do_page_fault+0x1b4/0x400
  [    3.274408]  [<ffffffff810845d3>] do_group_exit+0x43/0xb0
  [    3.278557]  [<ffffffff81084654>] SyS_exit_group+0x14/0x20
  [    3.282693]  [<ffffffff818276b2>] entry_SYSCALL_64_fastpath+0x16/0x71
  [    3.290709] Kernel Offset: disabled
  [    3.293770] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
  [    3.293770]

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1781878

Title:
  VM fails to boot after evacuation when it uses ceph disk

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  If we use Ceph RBD as storage backend and Ceph Disks (image) have exclusive-lock feature, when a compute node goes down, the evacuation process works fine and nova detects the VM has a disk on a shared storage, so it rebuild the VM on another node. But after the evacuation, although nova marks the instance as active, the instance fails to boot and encounter a kernel panic caused by inability of the kernel to write on disk.

  It is possible to disable exclusive-lock feature on Ceph and the
  evacuation process works fine, but it needed to be enabled in some
  use-cases.

  Also there is a workaround for this problem, we were able to evacuate
  an instance successfully by removing the lock of the disk to the old
  instance using rbd command line, but I think it should be done in the
  code of rbd driver in Nova and Cinder.

  The problem seams to be with the exclusive-lock feature. when a disk
  has exclusive-lock enabled, as soon as a client (the VM) connects and
  writes on disk, Ceph locks the disk for the client (lock-on-write)
  (also if we enable lock-on-read in Ceph conf, it would lock the disk
  on the first read). In the evacuation process, since there is no
  defined process to remove the exclusive-lock from the old VM, when the
  new VM tries to write on the disk, it fails to write since it can't
  get the lock.

  I found similar problem reported for kubernetes when a node goes down and the system tries to attach its volume to new Pod.
  https://github.com/openshift/origin/issues/7983#issuecomment-243736437
  There, some people proposed before bringing up the new instance, first blacklist the old instance, then unlock the disk and lock it for the new one.

  Steps to reproduce
  ==================
  * Create an instance (with ceph storage backend) and wait for boot
  * Poweroff the Host of the instance
  * Evacuate the instance
  * Check the Console in the dashboard

  Expected result
  ===============
  The instance should boot without any problem.

  Actual result
  =============
  The instance encounter kernel panic and fails to boot.

  Environment
  ===========
  1. Openstack Queens, Nova 17.0.2
  2. hypervisor: Libvirt (v4.0.0) + KVM
  2. Storage: 12.2.4

  Logs & Configs
  ==============
  Console log of the instance after it evacuation:

  [    2.352586] blk_update_request: I/O error, dev vda, sector 18436
  [    2.357199] Buffer I/O error on dev vda1, logical block 2, lost async page write
  [    2.363736] blk_update_request: I/O error, dev vda, sector 18702
  [    2.431927] Buffer I/O error on dev vda1, logical block 135, lost async page write
  [    2.442673] blk_update_request: I/O error, dev vda, sector 18708
  [    2.449862] Buffer I/O error on dev vda1, logical block 138, lost async page write
  [    2.460061] blk_update_request: I/O error, dev vda, sector 18718
  [    2.468022] Buffer I/O error on dev vda1, logical block 143, lost async page write
  [    2.477360] blk_update_request: I/O error, dev vda, sector 18722
  [    2.484106] Buffer I/O error on dev vda1, logical block 145, lost async page write
  [    2.493227] blk_update_request: I/O error, dev vda, sector 18744
  [    2.499642] Buffer I/O error on dev vda1, logical block 156, lost async page write
  [    2.505792] blk_update_request: I/O error, dev vda, sector 35082
  [    2.510281] Buffer I/O error on dev vda1, logical block 8325, lost async page write
  [    2.516296] Buffer I/O error on dev vda1, logical block 8326, lost async page write
  [    2.522749] blk_update_request: I/O error, dev vda, sector 35096
  [    2.527483] Buffer I/O error on dev vda1, logical block 8332, lost async page write
  [    2.533616] Buffer I/O error on dev vda1, logical block 8333, lost async page write
  [    2.540085] blk_update_request: I/O error, dev vda, sector 35104
  [    2.545149] blk_update_request: I/O error, dev vda, sector 36236
  [    2.549948] JBD2: recovery failed
  [    2.552989] EXT4-fs (vda1): error loading journal
  [    2.557228] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  [    2.563139] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
  [    2.704190] JBD2: recovery failed
  [    2.708709] EXT4-fs (vda1): error loading journal
  [    2.714963] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  mount: mounting /dev/vda1 on /newroot failed: Invalid argument
  umount: can't umount /dev/vda1: Invalid argument
  mcb [info=LABEL=cirros-rootfs dev=/dev/vda1 target=/newroot unmount=cbfail callback=check_sbin_init ret=1: failed to unmount
  [    2.886773] JBD2: recovery failed
  [    2.892670] EXT4-fs (vda1): error loading journal
  [    2.900580] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  [    2.911330] EXT4-fs (vda1): couldn't mount as ext2 due to feature incompatibilities
  [    3.044295] JBD2: recovery failed
  [    3.050363] EXT4-fs (vda1): error loading journal
  [    3.058689] VFS: Dirty inode writeback failed for block device vda1 (err=-5).
  mount: mounting /dev/vda1 on /newroot failed: Invalid argument
  info: copying initramfs to /dev/vda1
  mount: can't find /newroot in /proc/mounts
  info: initramfs loading root from /dev/vda1
  BusyBox v1.23.2 (2017-11-20 02:37:12 UTC) multi-call binary.

  Usage: switch_root [-c /dev/console] NEW_ROOT NEW_INIT [ARGS]

  Free initramfs and switch to another root fs:
  chroot to NEW_ROOT, delete all in /, move NEW_ROOT to /,
  execute NEW_INIT. PID must be 1. NEW_ROOT must be a mountpoint.

   -c DEV Reopen stdio to DEV after switch

  [    3.170388] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
  [    3.170388]
  [    3.186305] CPU: 0 PID: 1 Comm: switch_root Not tainted 4.4.0-28-generic #47-Ubuntu
  [    3.198826] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
  [    3.213538]  0000000000000086 000000004cbc7242 ffff88001f63be10 ffffffff813eb1a3
  [    3.227588]  ffffffff81cb10d8 ffff88001f63bea8 ffff88001f63be98 ffffffff8118bf57
  [    3.241405]  ffff880000000010 ffff88001f63bea8 ffff88001f63be40 000000004cbc7242
  [    3.251820] Call Trace:
  [    3.254191]  [<ffffffff813eb1a3>] dump_stack+0x63/0x90
  [    3.258257]  [<ffffffff8118bf57>] panic+0xd3/0x215
  [    3.261865]  [<ffffffff81184e1e>] ? perf_event_exit_task+0xbe/0x350
  [    3.266173]  [<ffffffff81084541>] do_exit+0xae1/0xaf0
  [    3.269989]  [<ffffffff8106b554>] ? __do_page_fault+0x1b4/0x400
  [    3.274408]  [<ffffffff810845d3>] do_group_exit+0x43/0xb0
  [    3.278557]  [<ffffffff81084654>] SyS_exit_group+0x14/0x20
  [    3.282693]  [<ffffffff818276b2>] entry_SYSCALL_64_fastpath+0x16/0x71
  [    3.290709] Kernel Offset: disabled
  [    3.293770] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
  [    3.293770]

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1781878/+subscriptions
Follow ups

[Bug 1781878] Re: VM fails to boot after evacuation when it uses ceph disk
From: melanie witt, 2019-05-15