yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1738297] [NEW] Nova Destroys Local Disks for Instance with Broken iSCSI Connection to Cinder Volume Upon Resume from Suspend?

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Chris Martin <chris@xxxxxxxxx>
Date: Thu, 14 Dec 2017 23:17:24 -0000
Reply-to: Bug 1738297 <1738297@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Background: Libvirt + KVM cloud running Newton (but relevant code
appears the same on master). Earlier this week we had some issues with a
Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
100% CPU (after running for several months) and Compute nodes lost iSCSI
connection. I had to restart tgt, cinder-volume service, and a number of
compute hosts + instances.

Today, a user tried resuming their instance which was suspended before
aforementioned trouble. (Note: this instance has root and ephemeral
disks stored locally, third disk on shared Cinder storage). It appears
(per below-linked logs) that the iSCSI connection from the compute host
to the Cinder storage server was broken/missing, and because of this,
Cinder apparently "cleaned up" the instance including *destroying its
disk files*. Instance is now in error state.

nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/

We're still running Newton but the code appears the same on master.
Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
(Someone, please correct me if the available data suggest otherwise!)

Nobody requested a Delete action, so this appears to be Nova deciding to
destroy an instance's local disks after encountering an otherwise-
unhandled exception related to the iSCSI device being unavailable. I
will try to reproduce and update the bug if successful.

For us, losing an instance's data is a Big Problem -- our users
(scientists) often treat their instances as pets that are configured by
hand. If an instance cannot be resumed, I would much rather Nova leave
the instance's disks intact for investigation / data recovery, instead
of throwing everything out. For deployments whose instances may contain
important data, could this behavior be made configurable? Perhaps
"destroy_disks_on_failed_resume = False" in nova.conf?

Thank you!

Chris Martin

(P.S. actually a Cinder question, but someone here may know: is there
something that can/should be done to re-initialize iSCSI connections
between compute nodes and a Cinder storage server after a recovered
failure of the iSCSI target service on the storage server?)

** Affects: nova
Importance: Undecided
Status: New

** Description changed:

Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
Cinder Volume Upon Resume from Suspend?

Earlier this week we had some issues with a Cinder storage server (it
uses LVM+iSCSI). tgt service was consuming 100% CPU (after running for
several months) and Compute nodes lost iSCSI connection. I had to
restart tgt, cinder-volume service, and a number of compute hosts +
instances. Today, a user tried resuming their instance which was
suspended before aforementioned trouble. (Note: this instance has root
disk stored locally, third disk on shared Cinder storage). It appears
(per logs) that the iSCSI connection from the compute host to the Cinder
storage server was broken/missing, and because of this, Cinder
apparently "cleaned up" the instance including *destroying its disk
files*. Instance is now in error state.

nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/

Nobody requested a Delete action, so this appears to be Nova deciding to
destroy an instance's local disks after encountering an otherwise-
- unhandled exception related to the iSCSI device being unavailable.
+ unhandled exception related to the iSCSI device being unavailable. I
+ will try to reproduce and update the bug if successful.

Thank you!

Chris Martin

P.S. actually a Cinder question, but someone here may know: is there
something that can/should be done to re-initialize iSCSI connections
between compute nodes and a Cinder storage server after a failure of the
iSCSI target service on the storage server?

** Description changed:

- Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
- Cinder Volume Upon Resume from Suspend?
-
Earlier this week we had some issues with a Cinder storage server (it
uses LVM+iSCSI). tgt service was consuming 100% CPU (after running for
several months) and Compute nodes lost iSCSI connection. I had to
restart tgt, cinder-volume service, and a number of compute hosts +
instances. Today, a user tried resuming their instance which was
suspended before aforementioned trouble. (Note: this instance has root
disk stored locally, third disk on shared Cinder storage). It appears
(per logs) that the iSCSI connection from the compute host to the Cinder
storage server was broken/missing, and because of this, Cinder
apparently "cleaned up" the instance including *destroying its disk
files*. Instance is now in error state.

nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/

Thank you!

Chris Martin

** Description changed:

- Earlier this week we had some issues with a Cinder storage server (it
- uses LVM+iSCSI). tgt service was consuming 100% CPU (after running for
- several months) and Compute nodes lost iSCSI connection. I had to
- restart tgt, cinder-volume service, and a number of compute hosts +
- instances. Today, a user tried resuming their instance which was
- suspended before aforementioned trouble. (Note: this instance has root
- disk stored locally, third disk on shared Cinder storage). It appears
- (per logs) that the iSCSI connection from the compute host to the Cinder
- storage server was broken/missing, and because of this, Cinder
- apparently "cleaned up" the instance including *destroying its disk
- files*. Instance is now in error state.
+ Background: Libvirt + KVM cloud running Newton (but relevant code
+ appears the same on master). Earlier this week we had some issues with a
+ Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
+ 100% CPU (after running for several months) and Compute nodes lost iSCSI
+ connection. I had to restart tgt, cinder-volume service, and a number of
+ compute hosts + instances.
+
+ Today, a user tried resuming their instance which was suspended before
+ aforementioned trouble. (Note: this instance has root disk stored
+ locally, third disk on shared Cinder storage). It appears (per logs)
+ that the iSCSI connection from the compute host to the Cinder storage
+ server was broken/missing, and because of this, Cinder apparently
+ "cleaned up" the instance including *destroying its disk files*.
+ Instance is now in error state.

nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/

Thank you!

Chris Martin

** Description changed:

Today, a user tried resuming their instance which was suspended before
aforementioned trouble. (Note: this instance has root disk stored
- locally, third disk on shared Cinder storage). It appears (per logs)
- that the iSCSI connection from the compute host to the Cinder storage
- server was broken/missing, and because of this, Cinder apparently
- "cleaned up" the instance including *destroying its disk files*.
- Instance is now in error state.
+ locally, third disk on shared Cinder storage). It appears (per below-
+ linked logs) that the iSCSI connection from the compute host to the
+ Cinder storage server was broken/missing, and because of this, Cinder
+ apparently "cleaned up" the instance including *destroying its disk
+ files*. Instance is now in error state.

nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/

Thank you!

Chris Martin

** Description changed:

Today, a user tried resuming their instance which was suspended before
aforementioned trouble. (Note: this instance has root disk stored
locally, third disk on shared Cinder storage). It appears (per below-
linked logs) that the iSCSI connection from the compute host to the
Cinder storage server was broken/missing, and because of this, Cinder
apparently "cleaned up" the instance including *destroying its disk
files*. Instance is now in error state.

nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/

Thank you!

Chris Martin

- P.S. actually a Cinder question, but someone here may know: is there
+ (P.S. actually a Cinder question, but someone here may know: is there
something that can/should be done to re-initialize iSCSI connections
- between compute nodes and a Cinder storage server after a failure of the
- iSCSI target service on the storage server?
+ between compute nodes and a Cinder storage server after a recovered
+ failure of the iSCSI target service on the storage server?)

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1738297

Title:
Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
Cinder Volume Upon Resume from Suspend?

Status in OpenStack Compute (nova):
New

Bug description:
Background: Libvirt + KVM cloud running Newton (but relevant code
appears the same on master). Earlier this week we had some issues with
a Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
100% CPU (after running for several months) and Compute nodes lost
iSCSI connection. I had to restart tgt, cinder-volume service, and a
number of compute hosts + instances.

Today, a user tried resuming their instance which was suspended before
aforementioned trouble. (Note: this instance has root and ephemeral
disks stored locally, third disk on shared Cinder storage). It appears
(per below-linked logs) that the iSCSI connection from the compute
host to the Cinder storage server was broken/missing, and because of
this, Cinder apparently "cleaned up" the instance including
*destroying its disk files*. Instance is now in error state.

nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/

Nobody requested a Delete action, so this appears to be Nova deciding
to destroy an instance's local disks after encountering an otherwise-
unhandled exception related to the iSCSI device being unavailable. I
will try to reproduce and update the bug if successful.

For us, losing an instance's data is a Big Problem -- our users
(scientists) often treat their instances as pets that are configured
by hand. If an instance cannot be resumed, I would much rather Nova
leave the instance's disks intact for investigation / data recovery,
instead of throwing everything out. For deployments whose instances
may contain important data, could this behavior be made configurable?
Perhaps "destroy_disks_on_failed_resume = False" in nova.conf?

Thank you!

Chris Martin

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1738297/+subscriptions

Follow ups

[Bug 1738297] Re: Nova Destroys Local Disks for Instance with Broken iSCSI Connection to Cinder Volume Upon Resume from Suspend
From: Sylvain Bauza, 2020-04-24