yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #68971
[Bug 1729371] [NEW] ResourceTracker races to delete instance allocations before instance is mapped to a cell
Public bug reported:
We hit this in queens CI where we go to swap instance allocations to the
migration uuid during a resize and the instance allocations on the
source node are not found, which shouldn't happen:
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-super-cond.txt.gz#_Oct_31_23_18_04_391235
Oct 31 23:18:04.391235 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: ERROR nova.conductor.tasks.migrate [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] [instance: 87329b8a-0fa5-4467-b69e-6c43d4633f54] Unable to find existing allocations for instance
Oct 31 23:18:04.421480 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: WARNING nova.scheduler.utils [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Failed to compute_task_migrate_server: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation: nova.exception.InstanceUnacceptable: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation
Looking in Placement for that instance, it's allocations are created by
the scheduler here:
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083
Oct 31 23:18:00.637846 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Starting request: 104.239.175.193 "PUT /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
Oct 31 23:18:00.684152 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Successfully wrote allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=23,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id='d44d813424704df8996b7d77840283c9'), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=24,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id='d44d813424704df8996b7d77840283c9')] {{(pid=15780) _set_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:249}}
And shortly after that we see them deleted:
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083
Oct 31 23:18:00.805083 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Starting request: 104.239.175.193 "DELETE /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
Oct 31 23:18:00.814342 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Successfully deleted allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id=<?>), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id=<?>)] {{(pid=15780) delete_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:307}}
Looking at the compute service logs, it looks like what is happening is
the update_available_resource periodic task on the compute is running in
between the time that the allocations are created on the compute node
via the scheduler and before the instance is created in the cell
database:
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_165850
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_780729
Oct 31 23:18:00.780729 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: DEBUG nova.compute.resource_tracker [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 has been deleted (perhaps locally). Deleting allocations that remained for this instance against this compute host: {'resources': {'VCPU': 1, 'MEMORY_MB': 64}}. {{(pid=23251) _remove_deleted_instances_allocations /opt/stack/new/nova/nova/compute/resource_tracker.py:1180}}
Oct 31 23:18:00.817854 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: INFO nova.scheduler.client.report [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Deleted allocation for instance 87329b8a-0fa5-4467-b69e-6c43d4633f54
That's because of this code that gets an InstanceNotFound and assumes
the instance was deleted:
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L1169-L1181
The compute service shouldn't have access to the nova_api database so we
can't check to see if the instance is still represented as a build
request, so we'll have to redo this deleted instance logic in the
resource tracker to fix this race bug.
** Affects: nova
Importance: High
Assignee: Dan Smith (danms)
Status: Triaged
** Affects: nova/pike
Importance: High
Status: Triaged
** Tags: compute placement
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1729371
Title:
ResourceTracker races to delete instance allocations before instance
is mapped to a cell
Status in OpenStack Compute (nova):
Triaged
Status in OpenStack Compute (nova) pike series:
Triaged
Bug description:
We hit this in queens CI where we go to swap instance allocations to
the migration uuid during a resize and the instance allocations on the
source node are not found, which shouldn't happen:
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-super-cond.txt.gz#_Oct_31_23_18_04_391235
Oct 31 23:18:04.391235 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: ERROR nova.conductor.tasks.migrate [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] [instance: 87329b8a-0fa5-4467-b69e-6c43d4633f54] Unable to find existing allocations for instance
Oct 31 23:18:04.421480 ubuntu-xenial-rax-iad-0000635032 nova-conductor[22376]: WARNING nova.scheduler.utils [None req-2bd7178b-307e-4342-a156-f9645b7f75a5 tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Failed to compute_task_migrate_server: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation: nova.exception.InstanceUnacceptable: Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 is unacceptable: Instance has no source node allocation
Looking in Placement for that instance, it's allocations are created
by the scheduler here:
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083
Oct 31 23:18:00.637846 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Starting request: 104.239.175.193 "PUT /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
Oct 31 23:18:00.684152 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-d285a5a5-704b-4f46-98d5-f6e5ae7f6b4f service placement] Successfully wrote allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=23,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id='d44d813424704df8996b7d77840283c9'), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=24,project_id='d349ceecbff744a2943294be5bb7e427',resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id='d44d813424704df8996b7d77840283c9')] {{(pid=15780) _set_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:249}}
And shortly after that we see them deleted:
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-placement-api.txt.gz#_Oct_31_23_18_00_805083
Oct 31 23:18:00.805083 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.requestlog [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Starting request: 104.239.175.193 "DELETE /placement/allocations/87329b8a-0fa5-4467-b69e-6c43d4633f54" {{(pid=15780) __call__ /opt/stack/new/nova/nova/api/openstack/placement/requestlog.py:38}}
Oct 31 23:18:00.814342 ubuntu-xenial-rax-iad-0000635032 devstack@placement-api.service[15777]: DEBUG nova.api.openstack.placement.handlers.allocation [None req-1fc99e13-67ca-4180-9a7f-d3a01a219b15 service placement] Successfully deleted allocations AllocationList[Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='MEMORY_MB',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=64,user_id=<?>), Allocation(consumer_id=87329b8a-0fa5-4467-b69e-6c43d4633f54,id=<?>,project_id=<?>,resource_class='VCPU',resource_provider=ResourceProvider(ec18595d-7007-47c9-bf13-38c8cf8a8bb0),used=1,user_id=<?>)] {{(pid=15780) delete_allocations /opt/stack/new/nova/nova/api/openstack/placement/handlers/allocation.py:307}}
Looking at the compute service logs, it looks like what is happening
is the update_available_resource periodic task on the compute is
running in between the time that the allocations are created on the
compute node via the scheduler and before the instance is created in
the cell database:
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_165850
http://logs.openstack.org/08/516708/3/gate/legacy-tempest-dsvm-
py35/4d8d6a3/logs/screen-n-cpu.txt.gz#_Oct_31_23_18_00_780729
Oct 31 23:18:00.780729 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: DEBUG nova.compute.resource_tracker [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Instance 87329b8a-0fa5-4467-b69e-6c43d4633f54 has been deleted (perhaps locally). Deleting allocations that remained for this instance against this compute host: {'resources': {'VCPU': 1, 'MEMORY_MB': 64}}. {{(pid=23251) _remove_deleted_instances_allocations /opt/stack/new/nova/nova/compute/resource_tracker.py:1180}}
Oct 31 23:18:00.817854 ubuntu-xenial-rax-iad-0000635032 nova-compute[23251]: INFO nova.scheduler.client.report [None req-9743bef5-318d-4e7e-a71b-26c62cd0af2d tempest-MigrationsAdminTest-834796780 tempest-MigrationsAdminTest-834796780] Deleted allocation for instance 87329b8a-0fa5-4467-b69e-6c43d4633f54
That's because of this code that gets an InstanceNotFound and assumes
the instance was deleted:
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L1169-L1181
The compute service shouldn't have access to the nova_api database so
we can't check to see if the instance is still represented as a build
request, so we'll have to redo this deleted instance logic in the
resource tracker to fix this race bug.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1729371/+subscriptions
Follow ups