yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #88232
[Bug 1960401] [NEW] missing gracefull recovery when attaching volume fails
Public bug reported:
Description
===========
When trying to attach a volume to an already running instance the nova-api requests the nova-compute service to create a BlockDeviceMapping. If the nova-api does not receive a response within `rpc_response_timeout` it will treat the request as failed and raise an exception.
There are cases where nova-compute actually already processed the request and just the reply did not reach the nova-api in time. This can happen e.g. in the following cases (or their combinations):
* nova-compute crashes/is unable to send the message reply back
* nova-api is handeling too many other requests and does not get processing time to receive the message
* a configuration error rabbitmq causes the message to be dropped before it can be read
* rabbitmq fails over to another node before the message is read (reply queues are non-persistent)
The state after the failed request will be the same in all cases. The
database will contain a BlockDeviceMapping entry for the volume +
instance combination that will never be cleaned up again. This entry
also causes the nova-api to reject all future attachments of this volume
to this instance (as it assumes it is already attached).
Manual intervention is now required (by deleting the offending db entry)
to allow the volume to be attached again.
It seems like there was already a proposal for a fix here which is abandoned (https://review.opendev.org/c/openstack/nova/+/731804).
I will propose a new fix based on the same idea
Steps to reproduce
==================
This issue is not reliably reproduceable. The rough steps should be
(with non-prod changes to make reproducing the issue more likely):
* create a instance and a volume of your choice
* create a unrelated high load on the nova-api
* configure a policy in rabbitmq to drop all messages in reply queues after 1ms
* try to attach the volume to the instance (you should hopefully get a messaging timeout)
* try to attach the volume again. It will fail as it is already attached
Expected result
===============
The second volume attach call should do an additional attempt to attach the volume
Actual result
=============
The second volume attach call failed as nova-api assumes the volume is already attached.
Environment
===========
stable/queens
(Issue is also present on master)
** Affects: nova
Importance: Undecided
Assignee: Felix Huettner (felix.huettner)
Status: In Progress
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1960401
Title:
missing gracefull recovery when attaching volume fails
Status in OpenStack Compute (nova):
In Progress
Bug description:
Description
===========
When trying to attach a volume to an already running instance the nova-api requests the nova-compute service to create a BlockDeviceMapping. If the nova-api does not receive a response within `rpc_response_timeout` it will treat the request as failed and raise an exception.
There are cases where nova-compute actually already processed the request and just the reply did not reach the nova-api in time. This can happen e.g. in the following cases (or their combinations):
* nova-compute crashes/is unable to send the message reply back
* nova-api is handeling too many other requests and does not get processing time to receive the message
* a configuration error rabbitmq causes the message to be dropped before it can be read
* rabbitmq fails over to another node before the message is read (reply queues are non-persistent)
The state after the failed request will be the same in all cases. The
database will contain a BlockDeviceMapping entry for the volume +
instance combination that will never be cleaned up again. This entry
also causes the nova-api to reject all future attachments of this
volume to this instance (as it assumes it is already attached).
Manual intervention is now required (by deleting the offending db
entry) to allow the volume to be attached again.
It seems like there was already a proposal for a fix here which is abandoned (https://review.opendev.org/c/openstack/nova/+/731804).
I will propose a new fix based on the same idea
Steps to reproduce
==================
This issue is not reliably reproduceable. The rough steps should be
(with non-prod changes to make reproducing the issue more likely):
* create a instance and a volume of your choice
* create a unrelated high load on the nova-api
* configure a policy in rabbitmq to drop all messages in reply queues after 1ms
* try to attach the volume to the instance (you should hopefully get a messaging timeout)
* try to attach the volume again. It will fail as it is already attached
Expected result
===============
The second volume attach call should do an additional attempt to attach the volume
Actual result
=============
The second volume attach call failed as nova-api assumes the volume is already attached.
Environment
===========
stable/queens
(Issue is also present on master)
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1960401/+subscriptions
Follow ups