yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1636489] Re: Volume attachment fails for all the available instances after running different volume operations for 1-2 hours or more.

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Maciej Szankin <maciej@xxxxxxxxxx>
Date: Tue, 21 Mar 2017 16:18:55 -0000
Reply-to: Bug 1636489 <1636489@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Since the Mitaka cycle we use the direct release model, which means
those bug reports should have Fix Released.

** Changed in: nova
Status: Fix Committed => Fix Released

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1636489

Title:
Volume attachment fails for all the available instances after running
different volume operations for 1-2 hours or more.

Status in OpenStack Compute (nova):
Fix Released

Bug description:
Steps to Reproduce:
1. Configure Devstack setup with storage backend (LVM).
2. Create at least 10-15 instances.
3. Run different volume operations via automation script for 5 hours or more.
4. Wait for 1-2 hours.

Observations:
1. Attachment failed for every volume attached to an instance. “attachVolume--break out wait after 5 minutes. ERROR >>>>>> Failed to attach volume” record is displayed in automation script logs.
2. Error:DeviceIsBusy exception is raised (observed in n-cpu.log).

Additional Note:
It is observed only in Devstack stable/Mitaka and stable/Newton release. It works perfectly well with Devstack stable/Liberty release. Different volume operations executed randomnly via automation script are: create_volume, create_snapshot, delete_snapshot, delete_volume, attach_volume, detach_volume.

Possible Suspect after analysis:
Before failure when the last detachment request comes to an instance, Nova's "detach_volume" fires the detach method into libvirt, which claims success, but the device is still attached as per the guest XML file. Hypervisor in libvirt is trying to take an exclusive lock on the disk for the subsequent attachment request, that all I/O caching is disabled. Libvirt will treat this metadata as a black box, never attempting to interpret or modify it. Nova then finishes the teardown, releasing the resources, which then causes I/O errors in the guest, and subsequent volume_attach requests from Nova to fail spectacularly due to it trying to use an in-use resource. This appears to be a race condition, in that it creates an intermittent issue and a complete attachment failure after different volume operations are triggered continuously.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1636489/+subscriptions

References

[Bug 1636489] [NEW] Volume attachment fails for all the available instances after running different volume operations for 1-2 hours or more.
From: Sumit Girish Shatwara, 2016-10-25