yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #82406
[Bug 1738297] Re: Nova Destroys Local Disks for Instance with Broken iSCSI Connection to Cinder Volume Upon Resume from Suspend
I'm happy the main root cause is fixed (deleting the source disks).
To be clear, you can configure to resume guest states on compute service
restarts with the flag
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.resume_guests_state_on_host_boot
Closing the bug.
** Changed in: nova
Status: New => Won't Fix
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1738297
Title:
Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
Cinder Volume Upon Resume from Suspend
Status in OpenStack Compute (nova):
Won't Fix
Bug description:
Background: Libvirt + KVM cloud running Newton (but relevant code
appears the same on master). Earlier this week we had some issues with
a Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
100% CPU (after running for several months) and Compute nodes lost
iSCSI connection. I had to restart tgt, cinder-volume service, and a
number of compute hosts + instances.
Today, a user tried resuming their instance which was suspended before
aforementioned trouble. (Note: this instance has root and ephemeral
disks stored locally, third disk on shared Cinder storage). It appears
(per below-linked logs) that the iSCSI connection from the compute
host to the Cinder storage server was broken/missing, and because of
this, Cinder apparently "cleaned up" the instance including
*destroying its disk files*. Instance is now in error state.
nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/
We're still running Newton but the code appears the same on master.
Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
(Someone, please correct me if the available data suggest otherwise!)
Nobody requested a Delete action, so this appears to be Nova deciding
to destroy an instance's local disks after encountering an otherwise-
unhandled exception related to the iSCSI device being unavailable. I
will try to reproduce and update the bug if successful.
For us, losing an instance's data is a Problem -- our users
(scientists) often store unique data on instances that are configured
by hand. If an instance cannot be resumed, I would much rather Nova
leave the instance's disks intact for investigation / data recovery,
instead of throwing everything out. For deployments whose instances
may contain important data, could this behavior be made configurable?
Perhaps "destroy_disks_on_failed_resume = False" in nova.conf?
Thank you!
Chris Martin
(P.S. actually a Cinder question, but someone here may know: is there
something that can/should be done to re-initialize iSCSI connections
between compute nodes and a Cinder storage server after a recovered
failure of the iSCSI target service on the storage server?)
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1738297/+subscriptions
References