yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #81776
[Bug 1859496] Re: Deleting stuck build instance may leak allocations
Reviewed: https://review.opendev.org/702368
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f35930eef8fa27ee972e87366abb38596839fdba
Submitter: Zuul
Branch: master
commit f35930eef8fa27ee972e87366abb38596839fdba
Author: Alexandre Arents <alexandre.arents@xxxxxxxxxxxx>
Date: Mon Jan 13 15:53:24 2020 +0000
Avoid allocation leak when deleting instance stuck in BUILD
During instance build, conductor claim resources to scheduler
and create instance DB entry in cell.
If for any reason conductor is not able to complete a build after
instance claim (ex: AMQP issues, conductor restart before build completes)
and in the mean time user requests deletion of its stuck instance in BUILD,
nova api will delete build_request but let allocation in place resulting
in a leak.
The change proposes that nova api ensures allocation cleanup is made
in case of ongoing/incomplete build.
Note that because build did not reach a cell, compute is not able to heal
allocation during its periodic update_available_resource task.
Furthermore, it ensures that instance mapping is also queued for deletion.
Change-Id: I4d3193d8401614311010ed0e055fcb3aaeeebaed
Closes-Bug: #1859496
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1859496
Title:
Deleting stuck build instance may leak allocations
Status in OpenStack Compute (nova):
Fix Released
Bug description:
Description
===========
After issues in control plane during instance creation,
Instance may stay stuck in BUILD state.
Even after deleting them, placement allocation may remain,
and compute host log is complaining that:
Instance eba20a0f-5856-4600-bcaa-7b758d04b5c5 has allocations against this compute host but is not found in the database.
Steps to reproduce
==================
On a fresh devstack master install
1) open a terminal that display entry in placement.allocations and nova_cell1.instances all seconds:
while true ; do date ; mysql -e "select * from placement.allocations" ; mysql -e "select * from nova_cell1.instances where deleted=0" ;sleep 1 ; done
2) Trigguer a spawn of 50 instances & kill rabbit after 5sec to simulate issue on control plane:
openstack server create --flavor m1.tiny --image cirros-0.4.0-x86_64-disk --nic net-id=private alex --min 50 --max 50 & sleep 5 ; sudo pkill rabbitmq-server
Note: To reach the bug, goal is to get instances Allocated by
scheduler, but not let the time to conductor to create entry in
nova_cell1.instances
You should see allocations appearing in allocations:
+---------------------+------------+------+----------------------+--------------------------------------+-------------------+------+
| created_at | updated_at | id | resource_provider_id | consumer_id | resource_class_id | used |
+---------------------+------------+------+----------------------+--------------------------------------+-------------------+------+
| 2020-01-13 11:02:51 | NULL | 1727 | 1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 | 2 | 1 |
| 2020-01-13 11:02:51 | NULL | 1728 | 1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 | 1 | 512 |
| 2020-01-13 11:02:51 | NULL | 1729 | 1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 | 0 | 1 |
| 2020-01-13 11:02:51 | NULL | 1730 | 1 | 3cd1b8be-6997-452e-86e0-5013c9ab6bda | 2 | 1 |
| 2020-01-13 11:02:51 | NULL | 1731 | 1 | 3cd1b8be-6997-452e-86e0-5013c9ab6bda | 1 | 512 |
.....
instances are all stuck in BUILD at this stage
3) delete instances:
openstack server list | awk '/m1.tiny/ {print $2}' | xargs openstack server delete
4) service rabbitmq-server start
5) openstack server list
<display nothing>
6) mysql -e "select count(*) from placement.allocations"
+----------+
| count(*) |
+----------+
| 150 |
+----------+
Allocation remains
7) nova-compute logs complaining that:
Instance eba20a0f-5856-4600-bcaa-7b758d04b5c5 has allocations against this compute host but is not found in the database.
Expected result
===============
placement allocation of instance have to be cleanup after deletion
Actual result
=============
placement allocation of instance are leaked.
Environment
===========
At least stein to master seems impacted
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1859496/+subscriptions
References