← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1821696] [NEW] Failed to start instances with encrypted volumes

 

Public bug reported:

Description
===========
We hit this bug after doing a complete cluster shutdown due to server room maintenance. The bug is however more easily reproducible.

When cold starting an instance with an encrypted volume attached, it
fails so start with a VolumeEncryptionNotSupported error.

https://github.com/openstack/os-
brick/blob/stable/rocky/os_brick/encryptors/cryptsetup.py#L52

Steps to reproduce
==================

* Deploy Openstack with Barbican support using Kolla.
* Create an encrypted volume type
* Create an encrypted volume
* Create an instance and attach the encrypted folder
* Enjoy your new instance and volume, install software and store data
* In our case, we shut down the entire cluster and restarted it again. First all instances were stopped in Horizon using Shut down instance command. We use Ceph so we then stopped that using these procedures https://ceph.com/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/ and then shut down the compute / storage nodes and then the controller nodes one by one. Then we started the cluster in the reverse order, verified all services were up and running, examined logs and then started the instances. Instances without encrypted volumes started fine.
* Instances with encrypted volumes fail to start with VolumeEncryptionNotSupported.

Note: It is possible to recreate the problem by using a Hard Reboot
(possibly related https://bugs.launchpad.net/nova/+bug/1597234) or by
shutting down instances and then restarting all Openstack services on
that compute node.

Expected results
================
Instances with encrypted volumes should start fine, even after a Hard Reboot or a complete cluster shutdown.

Actual results
==============
Instances with encrypted volumes failed to start with VolumeEncryptionNotSupported

https://pastebin.com/mvMbJQRb

Environment
===========

1. Openstack version
Environment is established by Kolla (Rocky release).

2. Hypervisor
KVM on RHEL

3. Storage type
Ceph using Kolla (Rocky release)

Analysis
========
There seems to be a problem related to this code not behaving as expected:

https://github.com/openstack/nova/blob/stable/rocky/nova/virt/libvirt/driver.py#L1049

It seems that it is expected that the exception should be ignored and
logged, but for some reason, the `ctxt.reraise = False` does not work as
expected:

self.force_reraise() is called in
https://github.com/openstack/oslo.utils/blob/stable/rocky/oslo_utils/excutils.py#L220
which it should not have hit since `reraise` is expected to be `False`.

We did some hacking and just swallowed the exception by commenting out
the `excutils.save_and_reraise_exception()` section and replacing it
with a simple `pass`.

Then the instance booted - but it could not boot from the image. But, it
was then possible to remove the encrypted volume attachment, reboot the
server and then reattach the encrypted volume.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821696

Title:
  Failed to start instances with encrypted volumes

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  We hit this bug after doing a complete cluster shutdown due to server room maintenance. The bug is however more easily reproducible.

  When cold starting an instance with an encrypted volume attached, it
  fails so start with a VolumeEncryptionNotSupported error.

  https://github.com/openstack/os-
  brick/blob/stable/rocky/os_brick/encryptors/cryptsetup.py#L52

  Steps to reproduce
  ==================

  * Deploy Openstack with Barbican support using Kolla.
  * Create an encrypted volume type
  * Create an encrypted volume
  * Create an instance and attach the encrypted folder
  * Enjoy your new instance and volume, install software and store data
  * In our case, we shut down the entire cluster and restarted it again. First all instances were stopped in Horizon using Shut down instance command. We use Ceph so we then stopped that using these procedures https://ceph.com/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/ and then shut down the compute / storage nodes and then the controller nodes one by one. Then we started the cluster in the reverse order, verified all services were up and running, examined logs and then started the instances. Instances without encrypted volumes started fine.
  * Instances with encrypted volumes fail to start with VolumeEncryptionNotSupported.

  Note: It is possible to recreate the problem by using a Hard Reboot
  (possibly related https://bugs.launchpad.net/nova/+bug/1597234) or by
  shutting down instances and then restarting all Openstack services on
  that compute node.

  Expected results
  ================
  Instances with encrypted volumes should start fine, even after a Hard Reboot or a complete cluster shutdown.

  Actual results
  ==============
  Instances with encrypted volumes failed to start with VolumeEncryptionNotSupported

  https://pastebin.com/mvMbJQRb

  Environment
  ===========

  1. Openstack version
  Environment is established by Kolla (Rocky release).

  2. Hypervisor
  KVM on RHEL

  3. Storage type
  Ceph using Kolla (Rocky release)

  Analysis
  ========
  There seems to be a problem related to this code not behaving as expected:

  https://github.com/openstack/nova/blob/stable/rocky/nova/virt/libvirt/driver.py#L1049

  It seems that it is expected that the exception should be ignored and
  logged, but for some reason, the `ctxt.reraise = False` does not work
  as expected:

  self.force_reraise() is called in
  https://github.com/openstack/oslo.utils/blob/stable/rocky/oslo_utils/excutils.py#L220
  which it should not have hit since `reraise` is expected to be
  `False`.

  We did some hacking and just swallowed the exception by commenting out
  the `excutils.save_and_reraise_exception()` section and replacing it
  with a simple `pass`.

  Then the instance booted - but it could not boot from the image. But,
  it was then possible to remove the encrypted volume attachment, reboot
  the server and then reattach the encrypted volume.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821696/+subscriptions


Follow ups