← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1729371] [NEW] ResourceTracker races to delete instance allocations before instance is mapped to a cell

 

Public bug reported:

We hit this in queens CI where we go to swap instance allocations to the
migration uuid during a resize and the instance allocations on the
source node are not found, which shouldn't happen:

http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-super-cond.txt.gz#_Oct_31_23_18_04_391235

Oct 31 23:18:04.391235 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: ERROR nova.conductor.tasks.migrate [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] [instance: 87329b8a-0fa5-4467-b69e-6c43d4633f54] Unable to find existing allocations for instance
Oct 31 23:18:04.421480 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: WARNING nova.scheduler.utils [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Failed to compute_task_migrate_server: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation: nova.exception.InstanceUnacceptable: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation

Looking in Placement for that instance, it's allocations are created by
the scheduler here:

http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083

Oct 31 23:18:00.637846 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Starting request: 104.239.175.193 "PUT /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
Oct 31 23:18:00.684152 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Successfully wrote allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=23,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id='d44d813424704df8996b7d77840283c9'), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=24,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id='d44d813424704df8996b7d77840283c9')] {{(pid=15780) _set_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:249}}

And shortly after that we see them deleted:

http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083

Oct 31 23:18:00.805083 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Starting request: 104.239.175.193 "DELETE /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
Oct 31 23:18:00.814342 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Successfully deleted allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id=<?>), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id=<?>)] {{(pid=15780) delete_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:307}}

Looking at the compute service logs, it looks like what is happening is
the update_available_resource periodic task on the compute is running in
between the time that the allocations are created on the compute node
via the scheduler and before the instance is created in the cell
database:

http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_165850

http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_780729

Oct 31 23:18:00.780729 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: DEBUG nova.compute.resource_tracker [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 has been deleted (perhaps locally). Deleting allocations that remained for this instance against this compute host: {'resources': {'VCPU': 1, 'MEMORY_MB': 64}}. {{(pid=23251) _remove_deleted_instances_allocations /opt/stack/new/nova/nova/compute/resource_tracker.py:1180}}
Oct 31 23:18:00.817854 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: INFO nova.scheduler.client.report [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Deleted allocation for instance 87329b8a-0fa5-4467-b69e-6c43d4633f54

That's because of this code that gets an InstanceNotFound and assumes
the instance was deleted:

https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L1169-L1181

The compute service shouldn't have access to the nova_api database so we
can't check to see if the instance is still represented as a build
request, so we'll have to redo this deleted instance logic in the
resource tracker to fix this race bug.

** Affects: nova
     Importance: High
     Assignee: Dan Smith (danms)
         Status: Triaged

** Affects: nova/pike
     Importance: High
         Status: Triaged


** Tags: compute placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1729371

Title:
  ResourceTracker races to delete instance allocations before instance
  is mapped to a cell

Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) pike series:
  Triaged

Bug description:
  We hit this in queens CI where we go to swap instance allocations to
  the migration uuid during a resize and the instance allocations on the
  source node are not found, which shouldn't happen:

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-n-super-cond.txt.gz#_Oct_31_23_18_04_391235

  Oct 31 23:18:04.391235 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: ERROR nova.conductor.tasks.migrate [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] [instance: 87329b8a-0fa5-4467-b69e-6c43d4633f54] Unable to find existing allocations for instance
  Oct 31 23:18:04.421480 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: WARNING nova.scheduler.utils [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Failed to compute_task_migrate_server: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation: nova.exception.InstanceUnacceptable: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation

  Looking in Placement for that instance, it's allocations are created
  by the scheduler here:

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083

  Oct 31 23:18:00.637846 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Starting request: 104.239.175.193 "PUT /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
  Oct 31 23:18:00.684152 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Successfully wrote allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=23,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id='d44d813424704df8996b7d77840283c9'), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=24,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id='d44d813424704df8996b7d77840283c9')] {{(pid=15780) _set_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:249}}

  And shortly after that we see them deleted:

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083

  Oct 31 23:18:00.805083 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Starting request: 104.239.175.193 "DELETE /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
  Oct 31 23:18:00.814342 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Successfully deleted allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id=<?>), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id=<?>)] {{(pid=15780) delete_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:307}}

  Looking at the compute service logs, it looks like what is happening
  is the update_available_resource periodic task on the compute is
  running in between the time that the allocations are created on the
  compute node via the scheduler and before the instance is created in
  the cell database:

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_165850

  http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
  py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_780729

  Oct 31 23:18:00.780729 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: DEBUG nova.compute.resource_tracker [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 has been deleted (perhaps locally). Deleting allocations that remained for this instance against this compute host: {'resources': {'VCPU': 1, 'MEMORY_MB': 64}}. {{(pid=23251) _remove_deleted_instances_allocations /opt/stack/new/nova/nova/compute/resource_tracker.py:1180}}
  Oct 31 23:18:00.817854 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: INFO nova.scheduler.client.report [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Deleted allocation for instance 87329b8a-0fa5-4467-b69e-6c43d4633f54

  That's because of this code that gets an InstanceNotFound and assumes
  the instance was deleted:

  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L1169-L1181

  The compute service shouldn't have access to the nova_api database so
  we can't check to see if the instance is still represented as a build
  request, so we'll have to redo this deleted instance logic in the
  resource tracker to fix this race bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1729371/+subscriptions


Follow ups