yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1818914] [NEW] Hypervisor resource usage on source still shows old flavor usage after resize confirm until update_available_resource periodic runs

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Wed, 06 Mar 2019 21:57:04 -0000
Reply-to: Bug 1818914 <1818914@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

I actually uncovered this due to some failing functional tests for
cross-cell resize:

https://review.openstack.org/#/c/641176/2/nova/compute/resource_tracker.py@503

But this goes back to https://review.openstack.org/#/c/370374/ for bug
1641750 and StarlingX has already fixed it:

https://github.com/starlingx-staging/stx-
nova/blob/master/nova/compute/resource_tracker.py#L728

The issue is that if you set the "update_resources_interval" to some
higher value, let's say 10 minutes, and resize an instance and
immediately confirm it, because let's say "resize_confirm_window" is set
to 1 second, then the GET /os-hypervisors/{hypervisor_id} results for
things like "local_gb_used", "memory_mb_used" and "vcpus_used" will
still show usage for the old flavor even though the instance is running
on the dest host with the new flavor and is gone from the source host.
That doesn't get fixed until the update_available_resource periodic task
runs on the source again (or the source nova-compute service is
restarted).

This is because the source compute resource tracker is not tracking the
migration in it's "tracked_migrations" dict. The resize claim happens on
the dest host and that's where the migration is "tracked". The
ResourceTracker on the source is tracking the instance in
'tracked_instances' rather than 'tracked_migrations'.

On the source host when the RT code is called here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1063

"tracked = uuid in self.tracked_instances" will be True because the
instance is on the source until it gets resized to the dest and then the
host value changes here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4500

But in the RT this means we won't get the itype here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1125

So the source RT doesn't track the migration here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1146

This is important because later in confirm_resize (on the source host)
when it calls RT.drop_move_claim:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4014

That will only update resource usage and decrement the old flavor if
it's a tracked migration:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L478

As noted from the TODO in the elif block below:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L489

This is semi-low priority given how latent it is and the fact it's self-
healing since the next run of the update_available_resource periodic
will fix the resource usage on the source host, but in a busy cloud it
could mean the difference between a server being able to build on that
source host or not based on it's tracked resource usage.

** Affects: nova
     Importance: Low
         Status: Triaged


** Tags: resize resource-tracker starlingx

** Changed in: nova
       Status: New => Triaged

** Tags added: starlingx

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1818914

Title:
  Hypervisor resource usage on source still shows old flavor usage after
  resize confirm until update_available_resource periodic runs

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  I actually uncovered this due to some failing functional tests for
  cross-cell resize:

  https://review.openstack.org/#/c/641176/2/nova/compute/resource_tracker.py@503

  But this goes back to https://review.openstack.org/#/c/370374/ for bug
  1641750 and StarlingX has already fixed it:

  https://github.com/starlingx-staging/stx-
  nova/blob/master/nova/compute/resource_tracker.py#L728

  The issue is that if you set the "update_resources_interval" to some
  higher value, let's say 10 minutes, and resize an instance and
  immediately confirm it, because let's say "resize_confirm_window" is
  set to 1 second, then the GET /os-hypervisors/{hypervisor_id} results
  for things like "local_gb_used", "memory_mb_used" and "vcpus_used"
  will still show usage for the old flavor even though the instance is
  running on the dest host with the new flavor and is gone from the
  source host. That doesn't get fixed until the
  update_available_resource periodic task runs on the source again (or
  the source nova-compute service is restarted).

  This is because the source compute resource tracker is not tracking
  the migration in it's "tracked_migrations" dict. The resize claim
  happens on the dest host and that's where the migration is "tracked".
  The ResourceTracker on the source is tracking the instance in
  'tracked_instances' rather than 'tracked_migrations'.

  On the source host when the RT code is called here:

  https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1063

  "tracked = uuid in self.tracked_instances" will be True because the
  instance is on the source until it gets resized to the dest and then
  the host value changes here:

  https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4500

  But in the RT this means we won't get the itype here:

  https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1125

  So the source RT doesn't track the migration here:

  https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1146

  This is important because later in confirm_resize (on the source host)
  when it calls RT.drop_move_claim:

  https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4014

  That will only update resource usage and decrement the old flavor if
  it's a tracked migration:

  https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L478

  As noted from the TODO in the elif block below:

  https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L489

  This is semi-low priority given how latent it is and the fact it's
  self-healing since the next run of the update_available_resource
  periodic will fix the resource usage on the source host, but in a busy
  cloud it could mean the difference between a server being able to
  build on that source host or not based on it's tracked resource usage.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1818914/+subscriptions
Follow ups

[Bug 1818914] Re: Hypervisor resource usage on source still shows old flavor usage after resize confirm until update_available_resource periodic runs
From: OpenStack Infra, 2019-06-07