yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1738297] Re: Nova Destroys Local Disks for Instance with Broken iSCSI Connection to Cinder Volume Upon Resume from Suspend

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Sylvain Bauza <sbauza@xxxxxxx>
Date: Fri, 24 Apr 2020 08:20:56 -0000
Reply-to: Bug 1738297 <1738297@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

I'm happy the main root cause is fixed (deleting the source disks).

To be clear, you can configure to resume guest states on compute service
restarts with the flag
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.resume_guests_state_on_host_boot

Closing the bug.


** Changed in: nova
       Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1738297

Title:
  Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
  Cinder Volume Upon Resume from Suspend

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  Background: Libvirt + KVM cloud running Newton (but relevant code
  appears the same on master). Earlier this week we had some issues with
  a Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
  100% CPU (after running for several months) and Compute nodes lost
  iSCSI connection. I had to restart tgt, cinder-volume service, and a
  number of compute hosts + instances.

  Today, a user tried resuming their instance which was suspended before
  aforementioned trouble. (Note: this instance has root and ephemeral
  disks stored locally, third disk on shared Cinder storage). It appears
  (per below-linked logs) that the iSCSI connection from the compute
  host to the Cinder storage server was broken/missing, and because of
  this, Cinder apparently "cleaned up" the instance including
  *destroying its disk files*. Instance is now in error state.

  nova-compute.log: http://paste.openstack.org/show/628991/
  /var/log/syslog: http://paste.openstack.org/show/628992/

  We're still running Newton but the code appears the same on master.
  Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
  A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
  One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
  (Someone, please correct me if the available data suggest otherwise!)

  Nobody requested a Delete action, so this appears to be Nova deciding
  to destroy an instance's local disks after encountering an otherwise-
  unhandled exception related to the iSCSI device being unavailable. I
  will try to reproduce and update the bug if successful.

  For us, losing an instance's data is a Problem -- our users
  (scientists) often store unique data on instances that are configured
  by hand. If an instance cannot be resumed, I would much rather Nova
  leave the instance's disks intact for investigation / data recovery,
  instead of throwing everything out. For deployments whose instances
  may contain important data, could this behavior be made configurable?
  Perhaps "destroy_disks_on_failed_resume = False" in nova.conf?

  Thank you!

  Chris Martin

  (P.S. actually a Cinder question, but someone here may know: is there
  something that can/should be done to re-initialize iSCSI connections
  between compute nodes and a Cinder storage server after a recovered
  failure of the iSCSI target service on the storage server?)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1738297/+subscriptions

References

[Bug 1738297] [NEW] Nova Destroys Local Disks for Instance with Broken iSCSI Connection to Cinder Volume Upon Resume from Suspend?
From: Chris Martin, 2017-12-14