yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1457359] [NEW] race condition in quick detach/attach to the same volume and vm

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Oscar Huang <xiwhuang@xxxxxxxxxx>
Date: Thu, 21 May 2015 07:34:20 -0000
Reply-to: Bug 1457359 <1457359@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

tested on Juno with Cell enabled.

The race condition happens as follows:
1. send a detach request to an existing VM with a volume;
2. send an attach request to attach the same volume to the same VM immediately after #1 in another process.

Expected result:
a. #2 get refused due to #1 is in progress, or
b. #2 finishes after #1 finished.

However race may happen with following sequences:

Req #1 finished physical action of detach >>
Req #1 finished cinder call (setting volume to available) >>
Req #2 came into Nova API and got through the call flow since volume is available now >>
Req #2 ran faster then Req #1 and updated Nova DB BDMs with volume info >>
Req #2 finished and removed the existing volume info in BDMs >>
now cinder volume status and nova bdm states went mismatched. The volume became inoperable of either attaching or detaching that both operations will be refused.

Also in our test case, child cell nova db and parent cell nova db went
mismatched since Req #2 passed Req#1 when Req#1 is call updating from
child cell to parent cell.

This issue is caused by no guard check against nova bdm table in attach
process. The suggested fix is to add a volume id check against nova bdm
table in the beginning of the request to guarantee so that for 1 single
volume/instance pair, no parallel modification will happen.

The attachment is a slice of logs show the message disorder triggered in
the test case

** Affects: nova
Importance: Undecided
Status: New

** Attachment added: "the message disorder triggered in the test case"
https://bugs.launchpad.net/bugs/1457359/+attachment/4401486/+files/attachdetach%20logs.txt

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1457359

Title:
race condition in quick detach/attach to the same volume and vm

Status in OpenStack Compute (Nova):
New

Bug description:
tested on Juno with Cell enabled.

Expected result:
a. #2 get refused due to #1 is in progress, or
b. #2 finishes after #1 finished.

However race may happen with following sequences:

Also in our test case, child cell nova db and parent cell nova db went
mismatched since Req #2 passed Req#1 when Req#1 is call updating from
child cell to parent cell.

This issue is caused by no guard check against nova bdm table in
attach process. The suggested fix is to add a volume id check against
nova bdm table in the beginning of the request to guarantee so that
for 1 single volume/instance pair, no parallel modification will
happen.

The attachment is a slice of logs show the message disorder triggered
in the test case

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1457359/+subscriptions

Follow ups

[Bug 1457359] Re: race condition in quick detach/attach to the same volume and vm
From: Launchpad Bug Tracker, 2015-09-18
[Bug 1457359] [NEW] race condition in quick detach/attach to the same volume and vm
From: Oscar Huang, 2015-05-21

References

[Bug 1457359] [NEW] race condition in quick detach/attach to the same volume and vm
From: Oscar Huang, 2015-05-21