yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #58141
[Bug 1636489] [NEW] Volume attachment fails for all the available instances after running different volume operations for 1-2 hours or more.
Public bug reported:
Steps to Reproduce:
1. Configure Devstack setup with storage backend (LVM).
2. Create at least 10-15 instances.
3. Run different volume operations via automation script for 5 hours or more.
4. Wait for 1-2 hours.
Observations:
1. Attachment failed for every volume attached to an instance. “attachVolume--break out wait after 5 minutes. ERROR >>>>>> Failed to attach volume” record is displayed in automation script logs.
2. Error:DeviceIsBusy exception is raised (observed in n-cpu.log).
Additional Note:
It is observed only in Devstack stable/Mitaka and stable/Newton release. It works perfectly well with Devstack stable/Liberty release. Different volume operations executed randomnly via automation script are: create_volume, create_snapshot, delete_snapshot, delete_volume, attach_volume, detach_volume.
Possible Suspect after analysis:
Before failure when the last detachment request comes to an instance, Nova's "detach_volume" fires the detach method into libvirt, which claims success, but the device is still attached as per the guest XML file. Hypervisor in libvirt is trying to take an exclusive lock on the disk for the subsequent attachment request, that all I/O caching is disabled. Libvirt will treat this metadata as a black box, never attempting to interpret or modify it. Nova then finishes the teardown, releasing the resources, which then causes I/O errors in the guest, and subsequent volume_attach requests from Nova to fail spectacularly due to it trying to use an in-use resource. This appears to be a race condition, in that it creates an intermittent issue and a complete attachment failure after different volume operations are triggered continuously.
** Affects: nova
Importance: Undecided
Status: New
** Attachment added: "Nova compute logs and continuous hours operations logs"
https://bugs.launchpad.net/bugs/1636489/+attachment/4766956/+files/nova_cho_logs.zip
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1636489
Title:
Volume attachment fails for all the available instances after running
different volume operations for 1-2 hours or more.
Status in OpenStack Compute (nova):
New
Bug description:
Steps to Reproduce:
1. Configure Devstack setup with storage backend (LVM).
2. Create at least 10-15 instances.
3. Run different volume operations via automation script for 5 hours or more.
4. Wait for 1-2 hours.
Observations:
1. Attachment failed for every volume attached to an instance. “attachVolume--break out wait after 5 minutes. ERROR >>>>>> Failed to attach volume” record is displayed in automation script logs.
2. Error:DeviceIsBusy exception is raised (observed in n-cpu.log).
Additional Note:
It is observed only in Devstack stable/Mitaka and stable/Newton release. It works perfectly well with Devstack stable/Liberty release. Different volume operations executed randomnly via automation script are: create_volume, create_snapshot, delete_snapshot, delete_volume, attach_volume, detach_volume.
Possible Suspect after analysis:
Before failure when the last detachment request comes to an instance, Nova's "detach_volume" fires the detach method into libvirt, which claims success, but the device is still attached as per the guest XML file. Hypervisor in libvirt is trying to take an exclusive lock on the disk for the subsequent attachment request, that all I/O caching is disabled. Libvirt will treat this metadata as a black box, never attempting to interpret or modify it. Nova then finishes the teardown, releasing the resources, which then causes I/O errors in the guest, and subsequent volume_attach requests from Nova to fail spectacularly due to it trying to use an in-use resource. This appears to be a race condition, in that it creates an intermittent issue and a complete attachment failure after different volume operations are triggered continuously.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1636489/+subscriptions
Follow ups