← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1729371] Re: ResourceTracker races to delete instance allocations before instance is mapped to a cell

 

Reviewed:  https://review.openstack.org/517009
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d176175db456e4a312eed37ed4f5adcc06292fd5
Submitter: Zuul
Branch:    master

commit d176175db456e4a312eed37ed4f5adcc06292fd5
Author: Dan Smith <dansmith@xxxxxxxxxx>
Date:   Wed Nov 1 08:49:07 2017 -0700

    Avoid deleting allocations for instances being built
    
    The resource tracker's _remove_deleted_instances_allocations() assumes that
    InstanceNotFound means that an instance was deleted. That's not quite accurate,
    as we would also see that in the window between creating allocations and actually
    creating the instance in the cell database. So, the code now will kill
    allocations for those instances before they are created.
    
    This change makes us look up the instance with read_deleted=yes, and if we find
    it with deleted=True, then we do the allocation removal. This does mean that
    someone running a full DB archive at the instant an instance is deleted in some
    way that didn't result in allocation removal as well could leak those. However,
    we can log that (unlikely) situation.
    
    Closes-Bug: #1729371
    
    Change-Id: I4482ac2ecf8e07c197fd24c520b7f11fd5a10945


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1729371

Title:
  ResourceTracker races to delete instance allocations before instance
  is mapped to a cell

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) pike series:
  Triaged

Bug description:
  We hit this in queens CI where we go to swap instance allocations to
  the migration uuid during a resize and the instance allocations on the
  source node are not found, which shouldn't happen:

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-n-super-cond.txt.gz#_Oct_31_23_18_04_391235

  Oct 31 23:18:04.391235 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: ERROR nova.conductor.tasks.migrate [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] [instance: 87329b8a-0fa5-4467-b69e-6c43d4633f54] Unable to find existing allocations for instance
  Oct 31 23:18:04.421480 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: WARNING nova.scheduler.utils [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Failed to compute_task_migrate_server: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation: nova.exception.InstanceUnacceptable: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation

  Looking in Placement for that instance, it's allocations are created
  by the scheduler here:

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083

  Oct 31 23:18:00.637846 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Starting request: 104.239.175.193 "PUT /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
  Oct 31 23:18:00.684152 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Successfully wrote allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=23,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id='d44d813424704df8996b7d77840283c9'), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=24,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id='d44d813424704df8996b7d77840283c9')] {{(pid=15780) _set_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:249}}

  And shortly after that we see them deleted:

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083

  Oct 31 23:18:00.805083 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Starting request: 104.239.175.193 "DELETE /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
  Oct 31 23:18:00.814342 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Successfully deleted allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id=<?>), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id=<?>)] {{(pid=15780) delete_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:307}}

  Looking at the compute service logs, it looks like what is happening
  is the update_available_resource periodic task on the compute is
  running in between the time that the allocations are created on the
  compute node via the scheduler and before the instance is created in
  the cell database:

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_165850

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_780729

  Oct 31 23:18:00.780729 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: DEBUG nova.compute.resource_tracker [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 has been deleted (perhaps locally). Deleting allocations that remained for this instance against this compute host: {'resources': {'VCPU': 1, 'MEMORY_MB': 64}}. {{(pid=23251) _remove_deleted_instances_allocations /opt/stack/new/nova/nova/compute/resource_tracker.py:1180}}
  Oct 31 23:18:00.817854 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: INFO nova.scheduler.client.report [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Deleted allocation for instance 87329b8a-0fa5-4467-b69e-6c43d4633f54

  That's because of this code that gets an InstanceNotFound and assumes
  the instance was deleted:

  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L1169-L1181

  The compute service shouldn't have access to the nova_api database so
  we can't check to see if the instance is still represented as a build
  request, so we'll have to redo this deleted instance logic in the
  resource tracker to fix this race bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1729371/+subscriptions


References