← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1859496] Re: Deleting stuck build instance may leak allocations

 

Reviewed:  https://review.opendev.org/702368
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f35930eef8fa27ee972e87366abb38596839fdba
Submitter: Zuul
Branch:    master

commit f35930eef8fa27ee972e87366abb38596839fdba
Author: Alexandre Arents <alexandre.arents@xxxxxxxxxxxx>
Date:   Mon Jan 13 15:53:24 2020 +0000

    Avoid allocation leak when deleting instance stuck in BUILD
    
    During instance build, conductor claim resources to scheduler
    and create instance DB entry in cell.
    
    If for any reason conductor is not able to complete a build after
    instance claim (ex: AMQP issues, conductor restart before build completes)
    and in the mean time user requests deletion of its stuck instance in BUILD,
    nova api will delete build_request but let allocation in place resulting
    in a leak.
    
    The change proposes that nova api ensures allocation cleanup is made
    in case of ongoing/incomplete build.
    Note that because build did not reach a cell, compute is not able to heal
    allocation during its periodic update_available_resource task.
    Furthermore, it ensures that instance mapping is also queued for deletion.
    
    Change-Id: I4d3193d8401614311010ed0e055fcb3aaeeebaed
    Closes-Bug: #1859496


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1859496

Title:
  Deleting stuck build instance may leak allocations

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===========

  After issues in control plane during instance creation,
  Instance may stay stuck in BUILD state.

  Even after deleting them, placement allocation may remain,
  and compute host log is complaining that:
  Instance eba20a0f-5856-4600-bcaa-7b758d04b5c5 has allocations against this compute host but is not found in the database.

  
  Steps to reproduce
  ==================

  On a fresh devstack master install

  
  1) open a terminal that display entry in placement.allocations and nova_cell1.instances all seconds:
  while true ; do  date ; mysql -e "select * from placement.allocations" ; mysql -e "select * from nova_cell1.instances where deleted=0" ;sleep 1 ; done

  2) Trigguer a spawn of 50 instances & kill rabbit after 5sec to simulate issue on control plane:
  openstack server create  --flavor m1.tiny --image cirros-0.4.0-x86_64-disk --nic net-id=private alex --min 50 --max 50 & sleep 5 ;  sudo pkill rabbitmq-server

  Note: To reach the bug,  goal is to get instances Allocated by
  scheduler, but not let the time to conductor to create entry in
  nova_cell1.instances

  You should see allocations appearing in allocations:
  +---------------------+------------+------+----------------------+--------------------------------------+-------------------+------+
  | created_at          | updated_at | id   | resource_provider_id | consumer_id                          | resource_class_id | used |
  +---------------------+------------+------+----------------------+--------------------------------------+-------------------+------+
  | 2020-01-13 11:02:51 | NULL       | 1727 |                    1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 |                 2 |    1 |
  | 2020-01-13 11:02:51 | NULL       | 1728 |                    1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 |                 1 |  512 |
  | 2020-01-13 11:02:51 | NULL       | 1729 |                    1 | 8d0a42fe-922b-4c08-afe3-65d65893d355 |                 0 |    1 |
  | 2020-01-13 11:02:51 | NULL       | 1730 |                    1 | 3cd1b8be-6997-452e-86e0-5013c9ab6bda |                 2 |    1 |
  | 2020-01-13 11:02:51 | NULL       | 1731 |                    1 | 3cd1b8be-6997-452e-86e0-5013c9ab6bda |                 1 |  512 |
  .....

  instances are all stuck in BUILD at this stage

  3) delete instances:
  openstack server list | awk '/m1.tiny/ {print $2}' | xargs openstack server delete
  4) service rabbitmq-server start
  5) openstack server list 
      <display nothing>
  6)  mysql -e "select count(*) from placement.allocations"
  +----------+
  | count(*) |
  +----------+
  |      150 |
  +----------+
  Allocation remains
  7) nova-compute logs complaining that:
  Instance eba20a0f-5856-4600-bcaa-7b758d04b5c5 has allocations against this compute host but is not found in the database.

  Expected result
  ===============
  placement allocation of instance have to be cleanup after deletion

  Actual result
  =============
  placement allocation of instance are leaked.

  
  Environment
  ===========
  At least stein to master seems impacted

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1859496/+subscriptions


References