yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1527623] [NEW] Nova might orphan volumes when it's racing to delete a volume-backed instance

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem@xxxxxxxxxx>
Date: Fri, 18 Dec 2015 14:23:02 -0000
Reply-to: Bug 1527623 <1527623@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Discussed in the -dev mailing list here:

http://lists.openstack.org/pipermail/openstack-
dev/2015-December/082596.html

When nova deletes a volume-backed instance, it detaches the volume first
here:

https://github.com/openstack/nova/blob/5508e11cf873384a28dc7416168d34e85f2c06cf/nova/compute/manager.py#L2293

And then deletes the volume here (if the delete_on_termination flag was
set to True):

https://github.com/openstack/nova/blob/5508e11cf873384a28dc7416168d34e85f2c06cf/nova/compute/manager.py#L2320

The problem is this code races since the detach is async, nova gets back
a 202 and then goes on to delete the volume, which can fail if the
volume status is not 'available' yet, as seen here:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Failed%20to%20delete%20volume%5C%22%20AND%20message:%5C%22due%20to%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22

http://logs.openstack.org/36/231936/9/check/gate-tempest-dsvm-full-
lio/31de861/logs/screen-n-cpu.txt.gz?level=TRACE#_2015-12-18_13_59_16_071

2015-12-18 13:59:16.071 WARNING nova.compute.manager [req-22431c70-78da-
4fea-b132-170d27177a6f tempest-TestVolumeBootPattern-196984582 tempest-
TestVolumeBootPattern-290257504] Failed to delete volume:
16f9252c-4036-463b-a053-60d4f46796c1 due to Invalid input received:
Invalid volume: Volume status must be available or error or
error_restoring or error_extending and  must not be migrating, attached,
belong to a consistency group or have snapshots. (HTTP 400) (Request-ID:
req-260c7d2a-d0aa-4ee1-b5a0-9b0c45f1d695)

This isn't an error in nova because the compute manager's
_delete_instance method calls _cleanup_volumes with raise_exc=False, but
this will orphan volumes in cinder, which then requires manual cleanup
on the cinder side.

** Affects: nova
     Importance: Medium
         Status: Triaged


** Tags: compute kilo-backport-potential liberty-backport-potential volumes

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1527623

Title:
  Nova might orphan volumes when it's racing to delete a volume-backed
  instance

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Discussed in the -dev mailing list here:

  http://lists.openstack.org/pipermail/openstack-
  dev/2015-December/082596.html

  When nova deletes a volume-backed instance, it detaches the volume
  first here:

  https://github.com/openstack/nova/blob/5508e11cf873384a28dc7416168d34e85f2c06cf/nova/compute/manager.py#L2293

  And then deletes the volume here (if the delete_on_termination flag
  was set to True):

  https://github.com/openstack/nova/blob/5508e11cf873384a28dc7416168d34e85f2c06cf/nova/compute/manager.py#L2320

  The problem is this code races since the detach is async, nova gets
  back a 202 and then goes on to delete the volume, which can fail if
  the volume status is not 'available' yet, as seen here:

  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Failed%20to%20delete%20volume%5C%22%20AND%20message:%5C%22due%20to%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22

  http://logs.openstack.org/36/231936/9/check/gate-tempest-dsvm-full-
  lio/31de861/logs/screen-n-cpu.txt.gz?level=TRACE#_2015-12-18_13_59_16_071

  2015-12-18 13:59:16.071 WARNING nova.compute.manager [req-22431c70
  -78da-4fea-b132-170d27177a6f tempest-TestVolumeBootPattern-196984582
  tempest-TestVolumeBootPattern-290257504] Failed to delete volume:
  16f9252c-4036-463b-a053-60d4f46796c1 due to Invalid input received:
  Invalid volume: Volume status must be available or error or
  error_restoring or error_extending and  must not be migrating,
  attached, belong to a consistency group or have snapshots. (HTTP 400)
  (Request-ID: req-260c7d2a-d0aa-4ee1-b5a0-9b0c45f1d695)

  This isn't an error in nova because the compute manager's
  _delete_instance method calls _cleanup_volumes with raise_exc=False,
  but this will orphan volumes in cinder, which then requires manual
  cleanup on the cinder side.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1527623/+subscriptions

Follow ups

[Bug 1527623] Re: Nova might orphan volumes when it's racing to delete a volume-backed instance
From: Matt Riedemann, 2018-08-03