yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1763183] [NEW] DELETE /os-services/{service_id} does not block for hosted instances

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Wed, 11 Apr 2018 21:18:39 -0000
Reply-to: Bug 1763183 <1763183@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

This came up while reviewing the fix for bug 1756179:

https://review.openstack.org/#/c/554920/6/nova/api/openstack/compute/services.py@226

Full IRC conversation is here:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-
nova.2018-04-11.log.html#t2018-04-11T20:32:13

The summary is that it's possible to delete a compute service and it's
associated compute node record even if that compute node has instances
on it.

Before placement, this wasn't a huge problem because you could evacuate
the instances to another host or if you brought the host back up, it
will recreate the service and compute node and the resource tracker will
"heal" itself by finding instances running on that host and node combo:

https://github.com/openstack/nova/blob/2c5da2212c3fa3e589c4af171486a2097fd8c54e/nova/compute/resource_tracker.py#L714

The problem is after we started requiring placement, and creating
allocations in the scheduler in Pike, those allocations are against the
compute_nodes.uuid for the compute node resource provider. If the
service and it's related compute node record are deleted, restarting the
service will create a new service and compute node record with a new
UUID which will result in a new resource provider in placement, and the
instances running on that host will have allocations against the now
orphaned resource provider. The new resource provider will be reporting
incorrect consumption so scheduling will also be affected.

So we should block deleting a compute service (and it's node) here:

https://github.com/openstack/nova/blob/2c5da2212c3fa3e589c4af171486a2097fd8c54e/nova/api/openstack/compute/services.py#L213

If that host (node) has instances on it.

This problem goes back to Pike. Ocata is OK in that the resource tracker
on Ocata computes will "heal" allocations during the
update_available_resource periodic task (and when the compute service
starts up), and in Ocata the FilterScheduler does not create allocations
in Placement.

** Affects: nova
     Importance: High
     Assignee: Matt Riedemann (mriedem)
         Status: Triaged

** Affects: nova/pike
     Importance: Undecided
         Status: New

** Affects: nova/queens
     Importance: Undecided
         Status: New


** Tags: api placement

** Also affects: nova/pike
   Importance: Undecided
       Status: New

** Also affects: nova/queens
   Importance: Undecided
       Status: New

** Changed in: nova
     Assignee: (unassigned) => Matt Riedemann (mriedem)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1763183

Title:
  DELETE /os-services/{service_id} does not block for hosted instances

Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) pike series:
  New
Status in OpenStack Compute (nova) queens series:
  New

Bug description:
  This came up while reviewing the fix for bug 1756179:

  https://review.openstack.org/#/c/554920/6/nova/api/openstack/compute/services.py@226

  Full IRC conversation is here:

  http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-
  nova.2018-04-11.log.html#t2018-04-11T20:32:13

  The summary is that it's possible to delete a compute service and it's
  associated compute node record even if that compute node has instances
  on it.

  Before placement, this wasn't a huge problem because you could
  evacuate the instances to another host or if you brought the host back
  up, it will recreate the service and compute node and the resource
  tracker will "heal" itself by finding instances running on that host
  and node combo:

  https://github.com/openstack/nova/blob/2c5da2212c3fa3e589c4af171486a2097fd8c54e/nova/compute/resource_tracker.py#L714

  The problem is after we started requiring placement, and creating
  allocations in the scheduler in Pike, those allocations are against
  the compute_nodes.uuid for the compute node resource provider. If the
  service and it's related compute node record are deleted, restarting
  the service will create a new service and compute node record with a
  new UUID which will result in a new resource provider in placement,
  and the instances running on that host will have allocations against
  the now orphaned resource provider. The new resource provider will be
  reporting incorrect consumption so scheduling will also be affected.

  So we should block deleting a compute service (and it's node) here:

  https://github.com/openstack/nova/blob/2c5da2212c3fa3e589c4af171486a2097fd8c54e/nova/api/openstack/compute/services.py#L213

  If that host (node) has instances on it.

  This problem goes back to Pike. Ocata is OK in that the resource
  tracker on Ocata computes will "heal" allocations during the
  update_available_resource periodic task (and when the compute service
  starts up), and in Ocata the FilterScheduler does not create
  allocations in Placement.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1763183/+subscriptions
Follow ups

[Bug 1763183] Related fix merged to nova (master)
From: OpenStack Infra, 2018-04-20