← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1738297] [NEW] Nova Destroys Local Disks for Instance with Broken iSCSI Connection to Cinder Volume Upon Resume from Suspend?

 

Public bug reported:

Background: Libvirt + KVM cloud running Newton (but relevant code
appears the same on master). Earlier this week we had some issues with a
Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
100% CPU (after running for several months) and Compute nodes lost iSCSI
connection. I had to restart tgt, cinder-volume service, and a number of
compute hosts + instances.

Today, a user tried resuming their instance which was suspended before
aforementioned trouble. (Note: this instance has root and ephemeral
disks stored locally, third disk on shared Cinder storage). It appears
(per below-linked logs) that the iSCSI connection from the compute host
to the Cinder storage server was broken/missing, and because of this,
Cinder apparently "cleaned up" the instance including *destroying its
disk files*. Instance is now in error state.

nova-compute.log: http://paste.openstack.org/show/628991/
/var/log/syslog: http://paste.openstack.org/show/628992/

We're still running Newton but the code appears the same on master.
Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
(Someone, please correct me if the available data suggest otherwise!)

Nobody requested a Delete action, so this appears to be Nova deciding to
destroy an instance's local disks after encountering an otherwise-
unhandled exception related to the iSCSI device being unavailable. I
will try to reproduce and update the bug if successful.

For us, losing an instance's data is a Big Problem -- our users
(scientists) often treat their instances as pets that are configured by
hand. If an instance cannot be resumed, I would much rather Nova leave
the instance's disks intact for investigation / data recovery, instead
of throwing everything out. For deployments whose instances may contain
important data, could this behavior be made configurable? Perhaps
"destroy_disks_on_failed_resume = False" in nova.conf?

Thank you!

Chris Martin

(P.S. actually a Cinder question, but someone here may know: is there
something that can/should be done to re-initialize iSCSI connections
between compute nodes and a Cinder storage server after a recovered
failure of the iSCSI target service on the storage server?)

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
  Cinder Volume Upon Resume from Suspend?
  
  Earlier this week we had some issues with a Cinder storage server (it
  uses LVM+iSCSI). tgt service was consuming 100% CPU (after running for
  several months) and Compute nodes lost iSCSI connection. I had to
  restart tgt, cinder-volume service, and a number of compute hosts +
  instances. Today, a user tried resuming their instance which was
  suspended before aforementioned trouble. (Note: this instance has root
  disk stored locally, third disk on shared Cinder storage). It appears
  (per logs) that the iSCSI connection from the compute host to the Cinder
  storage server was broken/missing, and because of this, Cinder
  apparently "cleaned up" the instance including *destroying its disk
  files*. Instance is now in error state.
  
  nova-compute.log: http://paste.openstack.org/show/628991/
  /var/log/syslog: http://paste.openstack.org/show/628992/
  
  We're still running Newton but the code appears the same on master.
  Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
  A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
  One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
  (Someone, please correct me if the available data suggest otherwise!)
  
  Nobody requested a Delete action, so this appears to be Nova deciding to
  destroy an instance's local disks after encountering an otherwise-
- unhandled exception related to the iSCSI device being unavailable.
+ unhandled exception related to the iSCSI device being unavailable. I
+ will try to reproduce and update the bug if successful.
  
  For us, losing an instance's data is a Big Problem -- our users
  (scientists) often treat their instances as pets that are configured by
  hand. If an instance cannot be resumed, I would much rather Nova leave
  the instance's disks intact for investigation / data recovery, instead
  of throwing everything out. For deployments whose instances may contain
  important data, could this behavior be made configurable? Perhaps
  "destroy_disks_on_failed_resume = False" in nova.conf?
  
  Thank you!
  
  Chris Martin
  
  P.S. actually a Cinder question, but someone here may know: is there
  something that can/should be done to re-initialize iSCSI connections
  between compute nodes and a Cinder storage server after a failure of the
  iSCSI target service on the storage server?

** Description changed:

- Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
- Cinder Volume Upon Resume from Suspend?
- 
  Earlier this week we had some issues with a Cinder storage server (it
  uses LVM+iSCSI). tgt service was consuming 100% CPU (after running for
  several months) and Compute nodes lost iSCSI connection. I had to
  restart tgt, cinder-volume service, and a number of compute hosts +
  instances. Today, a user tried resuming their instance which was
  suspended before aforementioned trouble. (Note: this instance has root
  disk stored locally, third disk on shared Cinder storage). It appears
  (per logs) that the iSCSI connection from the compute host to the Cinder
  storage server was broken/missing, and because of this, Cinder
  apparently "cleaned up" the instance including *destroying its disk
  files*. Instance is now in error state.
  
  nova-compute.log: http://paste.openstack.org/show/628991/
  /var/log/syslog: http://paste.openstack.org/show/628992/
  
  We're still running Newton but the code appears the same on master.
  Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
  A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
  One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
  (Someone, please correct me if the available data suggest otherwise!)
  
  Nobody requested a Delete action, so this appears to be Nova deciding to
  destroy an instance's local disks after encountering an otherwise-
  unhandled exception related to the iSCSI device being unavailable. I
  will try to reproduce and update the bug if successful.
  
  For us, losing an instance's data is a Big Problem -- our users
  (scientists) often treat their instances as pets that are configured by
  hand. If an instance cannot be resumed, I would much rather Nova leave
  the instance's disks intact for investigation / data recovery, instead
  of throwing everything out. For deployments whose instances may contain
  important data, could this behavior be made configurable? Perhaps
  "destroy_disks_on_failed_resume = False" in nova.conf?
  
  Thank you!
  
  Chris Martin
  
  P.S. actually a Cinder question, but someone here may know: is there
  something that can/should be done to re-initialize iSCSI connections
  between compute nodes and a Cinder storage server after a failure of the
  iSCSI target service on the storage server?

** Description changed:

- Earlier this week we had some issues with a Cinder storage server (it
- uses LVM+iSCSI). tgt service was consuming 100% CPU (after running for
- several months) and Compute nodes lost iSCSI connection. I had to
- restart tgt, cinder-volume service, and a number of compute hosts +
- instances. Today, a user tried resuming their instance which was
- suspended before aforementioned trouble. (Note: this instance has root
- disk stored locally, third disk on shared Cinder storage). It appears
- (per logs) that the iSCSI connection from the compute host to the Cinder
- storage server was broken/missing, and because of this, Cinder
- apparently "cleaned up" the instance including *destroying its disk
- files*. Instance is now in error state.
+ Background: Libvirt + KVM cloud running Newton (but relevant code
+ appears the same on master). Earlier this week we had some issues with a
+ Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
+ 100% CPU (after running for several months) and Compute nodes lost iSCSI
+ connection. I had to restart tgt, cinder-volume service, and a number of
+ compute hosts + instances.
+ 
+ Today, a user tried resuming their instance which was suspended before
+ aforementioned trouble. (Note: this instance has root disk stored
+ locally, third disk on shared Cinder storage). It appears (per logs)
+ that the iSCSI connection from the compute host to the Cinder storage
+ server was broken/missing, and because of this, Cinder apparently
+ "cleaned up" the instance including *destroying its disk files*.
+ Instance is now in error state.
  
  nova-compute.log: http://paste.openstack.org/show/628991/
  /var/log/syslog: http://paste.openstack.org/show/628992/
  
  We're still running Newton but the code appears the same on master.
  Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
  A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
  One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
  (Someone, please correct me if the available data suggest otherwise!)
  
  Nobody requested a Delete action, so this appears to be Nova deciding to
  destroy an instance's local disks after encountering an otherwise-
  unhandled exception related to the iSCSI device being unavailable. I
  will try to reproduce and update the bug if successful.
  
  For us, losing an instance's data is a Big Problem -- our users
  (scientists) often treat their instances as pets that are configured by
  hand. If an instance cannot be resumed, I would much rather Nova leave
  the instance's disks intact for investigation / data recovery, instead
  of throwing everything out. For deployments whose instances may contain
  important data, could this behavior be made configurable? Perhaps
  "destroy_disks_on_failed_resume = False" in nova.conf?
  
  Thank you!
  
  Chris Martin
  
  P.S. actually a Cinder question, but someone here may know: is there
  something that can/should be done to re-initialize iSCSI connections
  between compute nodes and a Cinder storage server after a failure of the
  iSCSI target service on the storage server?

** Description changed:

  Background: Libvirt + KVM cloud running Newton (but relevant code
  appears the same on master). Earlier this week we had some issues with a
  Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
  100% CPU (after running for several months) and Compute nodes lost iSCSI
  connection. I had to restart tgt, cinder-volume service, and a number of
  compute hosts + instances.
  
  Today, a user tried resuming their instance which was suspended before
  aforementioned trouble. (Note: this instance has root disk stored
- locally, third disk on shared Cinder storage). It appears (per logs)
- that the iSCSI connection from the compute host to the Cinder storage
- server was broken/missing, and because of this, Cinder apparently
- "cleaned up" the instance including *destroying its disk files*.
- Instance is now in error state.
+ locally, third disk on shared Cinder storage). It appears (per below-
+ linked logs) that the iSCSI connection from the compute host to the
+ Cinder storage server was broken/missing, and because of this, Cinder
+ apparently "cleaned up" the instance including *destroying its disk
+ files*. Instance is now in error state.
  
  nova-compute.log: http://paste.openstack.org/show/628991/
  /var/log/syslog: http://paste.openstack.org/show/628992/
  
  We're still running Newton but the code appears the same on master.
  Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
  A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
  One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
  (Someone, please correct me if the available data suggest otherwise!)
  
  Nobody requested a Delete action, so this appears to be Nova deciding to
  destroy an instance's local disks after encountering an otherwise-
  unhandled exception related to the iSCSI device being unavailable. I
  will try to reproduce and update the bug if successful.
  
  For us, losing an instance's data is a Big Problem -- our users
  (scientists) often treat their instances as pets that are configured by
  hand. If an instance cannot be resumed, I would much rather Nova leave
  the instance's disks intact for investigation / data recovery, instead
  of throwing everything out. For deployments whose instances may contain
  important data, could this behavior be made configurable? Perhaps
  "destroy_disks_on_failed_resume = False" in nova.conf?
  
  Thank you!
  
  Chris Martin
  
  P.S. actually a Cinder question, but someone here may know: is there
  something that can/should be done to re-initialize iSCSI connections
  between compute nodes and a Cinder storage server after a failure of the
  iSCSI target service on the storage server?

** Description changed:

  Background: Libvirt + KVM cloud running Newton (but relevant code
  appears the same on master). Earlier this week we had some issues with a
  Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
  100% CPU (after running for several months) and Compute nodes lost iSCSI
  connection. I had to restart tgt, cinder-volume service, and a number of
  compute hosts + instances.
  
  Today, a user tried resuming their instance which was suspended before
  aforementioned trouble. (Note: this instance has root disk stored
  locally, third disk on shared Cinder storage). It appears (per below-
  linked logs) that the iSCSI connection from the compute host to the
  Cinder storage server was broken/missing, and because of this, Cinder
  apparently "cleaned up" the instance including *destroying its disk
  files*. Instance is now in error state.
  
  nova-compute.log: http://paste.openstack.org/show/628991/
  /var/log/syslog: http://paste.openstack.org/show/628992/
  
  We're still running Newton but the code appears the same on master.
  Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
  A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
  One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
  (Someone, please correct me if the available data suggest otherwise!)
  
  Nobody requested a Delete action, so this appears to be Nova deciding to
  destroy an instance's local disks after encountering an otherwise-
  unhandled exception related to the iSCSI device being unavailable. I
  will try to reproduce and update the bug if successful.
  
  For us, losing an instance's data is a Big Problem -- our users
  (scientists) often treat their instances as pets that are configured by
  hand. If an instance cannot be resumed, I would much rather Nova leave
  the instance's disks intact for investigation / data recovery, instead
  of throwing everything out. For deployments whose instances may contain
  important data, could this behavior be made configurable? Perhaps
  "destroy_disks_on_failed_resume = False" in nova.conf?
  
  Thank you!
  
  Chris Martin
  
- P.S. actually a Cinder question, but someone here may know: is there
+ (P.S. actually a Cinder question, but someone here may know: is there
  something that can/should be done to re-initialize iSCSI connections
- between compute nodes and a Cinder storage server after a failure of the
- iSCSI target service on the storage server?
+ between compute nodes and a Cinder storage server after a recovered
+ failure of the iSCSI target service on the storage server?)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1738297

Title:
  Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
  Cinder Volume Upon Resume from Suspend?

Status in OpenStack Compute (nova):
  New

Bug description:
  Background: Libvirt + KVM cloud running Newton (but relevant code
  appears the same on master). Earlier this week we had some issues with
  a Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
  100% CPU (after running for several months) and Compute nodes lost
  iSCSI connection. I had to restart tgt, cinder-volume service, and a
  number of compute hosts + instances.

  Today, a user tried resuming their instance which was suspended before
  aforementioned trouble. (Note: this instance has root and ephemeral
  disks stored locally, third disk on shared Cinder storage). It appears
  (per below-linked logs) that the iSCSI connection from the compute
  host to the Cinder storage server was broken/missing, and because of
  this, Cinder apparently "cleaned up" the instance including
  *destroying its disk files*. Instance is now in error state.

  nova-compute.log: http://paste.openstack.org/show/628991/
  /var/log/syslog: http://paste.openstack.org/show/628992/

  We're still running Newton but the code appears the same on master.
  Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
  A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
  One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden.
  (Someone, please correct me if the available data suggest otherwise!)

  Nobody requested a Delete action, so this appears to be Nova deciding
  to destroy an instance's local disks after encountering an otherwise-
  unhandled exception related to the iSCSI device being unavailable. I
  will try to reproduce and update the bug if successful.

  For us, losing an instance's data is a Big Problem -- our users
  (scientists) often treat their instances as pets that are configured
  by hand. If an instance cannot be resumed, I would much rather Nova
  leave the instance's disks intact for investigation / data recovery,
  instead of throwing everything out. For deployments whose instances
  may contain important data, could this behavior be made configurable?
  Perhaps "destroy_disks_on_failed_resume = False" in nova.conf?

  Thank you!

  Chris Martin

  (P.S. actually a Cinder question, but someone here may know: is there
  something that can/should be done to re-initialize iSCSI connections
  between compute nodes and a Cinder storage server after a recovered
  failure of the iSCSI target service on the storage server?)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1738297/+subscriptions


Follow ups