← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1358379] [NEW] drop_resize_claim() can't release the resource in some small window

 

Public bug reported:

Currently the resize resource claim is achieved through resize_claim()
and drop_resize_claim() pair. In theory, the claim should be released
after the drop_resize_claim() be called. However, there is a small
window that this release will not happen.

Currently RT tracker resource usage by two category: the instances hosted on the node (the  _update_usage_from_instances())  and the migration in/out of the node (the _update_usage_from_migrations()). 
A instance hosted in the node is sure to have resource claim, an in/out migration that the instance is not hosted in the node will also have a resource claim. If a resize happens to the same host, then one claim will be tracked in the instance side and another is in the migration side. Such audit happens in the update_vailable_resources() periodic task.


Current drop_resize_claim() implementation always assume the related resource is in the tracked migration, however, this is not true if the drop_resize_claim() happens before the audit periodic task. Considering the audit happens in time t1 and (t1 + 60s) assuming the audit periodic is 60s. And between these two audit time, a instance in this node is resized to another node, and user confirm the resize() too (i.e. this node is the source node).

Because the resize happend between the audit periodic task, the RT has
no idea and no migration tracked. Thus when
drop_resize_claim(prefix='old_') happens, it has no resource claim to
release. The release will happen till next audit cycle, which will find
the host is not hosted in the node.

I'm not sure if this is really a  issue. I think a) the result purely
depends on the periodic task lengthy. If the periodic task lengthy is
very long, it will cause resource waste, or in worst situation, a
potential DoS issue. But it should be ok if the periodic task is short.
b)From an implementation point of view, drop_resize_claim(prefix='old_')
return successfully w/o release the resource is bogus.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1358379

Title:
  drop_resize_claim() can't release the resource in some small window

Status in OpenStack Compute (Nova):
  New

Bug description:
  Currently the resize resource claim is achieved through resize_claim()
  and drop_resize_claim() pair. In theory, the claim should be released
  after the drop_resize_claim() be called. However, there is a small
  window that this release will not happen.

  Currently RT tracker resource usage by two category: the instances hosted on the node (the  _update_usage_from_instances())  and the migration in/out of the node (the _update_usage_from_migrations()). 
  A instance hosted in the node is sure to have resource claim, an in/out migration that the instance is not hosted in the node will also have a resource claim. If a resize happens to the same host, then one claim will be tracked in the instance side and another is in the migration side. Such audit happens in the update_vailable_resources() periodic task.

  
  Current drop_resize_claim() implementation always assume the related resource is in the tracked migration, however, this is not true if the drop_resize_claim() happens before the audit periodic task. Considering the audit happens in time t1 and (t1 + 60s) assuming the audit periodic is 60s. And between these two audit time, a instance in this node is resized to another node, and user confirm the resize() too (i.e. this node is the source node).

  Because the resize happend between the audit periodic task, the RT has
  no idea and no migration tracked. Thus when
  drop_resize_claim(prefix='old_') happens, it has no resource claim to
  release. The release will happen till next audit cycle, which will
  find the host is not hosted in the node.

  I'm not sure if this is really a  issue. I think a) the result purely
  depends on the periodic task lengthy. If the periodic task lengthy is
  very long, it will cause resource waste, or in worst situation, a
  potential DoS issue. But it should be ok if the periodic task is
  short. b)From an implementation point of view,
  drop_resize_claim(prefix='old_') return successfully w/o release the
  resource is bogus.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1358379/+subscriptions


Follow ups

References