yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #78796
[Bug 1818914] Re: Hypervisor resource usage on source still shows old flavor usage after resize confirm until update_available_resource periodic runs
Reviewed: https://review.opendev.org/641806
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ad9f37350ad1f4e598a9a5df559b9160db1a11c1
Submitter: Zuul
Branch: master
commit ad9f37350ad1f4e598a9a5df559b9160db1a11c1
Author: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Thu Mar 7 16:07:18 2019 -0500
Update usage in RT.drop_move_claim during confirm resize
The confirm resize flow in the compute manager
runs on the source host. It calls RT.drop_move_claim
to drop resource usage from the source host for the
old flavor. The problem with drop_move_claim is it
only decrements the old flavor from the reported usage
if the instance is in RT.tracked_migrations, which will
only be there on the source host if the update_available_resource
periodic task runs before the resize is confirmed, otherwise
the instance is still just tracked in RT.tracked_instances on
the source host. This leaves the source compute incorrectly
reporting resource usage for the old flavor until the next
periodic runs, which could be a large window if resizes are
configured to automatically confirm, e.g. resize_confirm_window=1,
and the periodic interval is big, e.g. update_resources_interval=600.
This fixes the issue by also updating usage in drop_move_claim
when the instance is not in tracked_migrations but is in
tracked_instances.
Because of the tight coupling with the instance.migration_context
we need to ensure the migration_context still exists before
drop_move_claim is called during confirm_resize, so a test wrinkle
is added to enforce that.
test_drop_move_claim_on_revert also needed some updating for
reality because of how drop_move_claim is called during
revert_resize.
And finally, the functional recreate test is updated to show the
bug is fixed.
Change-Id: Ia6d8a7909081b0b856bd7e290e234af7e42a2b38
Closes-Bug: #1818914
Related-Bug: #1641750
Related-Bug: #1498126
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1818914
Title:
Hypervisor resource usage on source still shows old flavor usage after
resize confirm until update_available_resource periodic runs
Status in OpenStack Compute (nova):
Fix Released
Bug description:
I actually uncovered this due to some failing functional tests for
cross-cell resize:
https://review.openstack.org/#/c/641176/2/nova/compute/resource_tracker.py@503
But this goes back to https://review.openstack.org/#/c/370374/ for bug
1641750 and StarlingX has already fixed it:
https://github.com/starlingx-staging/stx-
nova/blob/master/nova/compute/resource_tracker.py#L728
The issue is that if you set the "update_resources_interval" to some
higher value, let's say 10 minutes, and resize an instance and
immediately confirm it, because let's say "resize_confirm_window" is
set to 1 second, then the GET /os-hypervisors/{hypervisor_id} results
for things like "local_gb_used", "memory_mb_used" and "vcpus_used"
will still show usage for the old flavor even though the instance is
running on the dest host with the new flavor and is gone from the
source host. That doesn't get fixed until the
update_available_resource periodic task runs on the source again (or
the source nova-compute service is restarted).
This is because the source compute resource tracker is not tracking
the migration in it's "tracked_migrations" dict. The resize claim
happens on the dest host and that's where the migration is "tracked".
The ResourceTracker on the source is tracking the instance in
'tracked_instances' rather than 'tracked_migrations'.
On the source host when the RT code is called here:
https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1063
"tracked = uuid in self.tracked_instances" will be True because the
instance is on the source until it gets resized to the dest and then
the host value changes here:
https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4500
But in the RT this means we won't get the itype here:
https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1125
So the source RT doesn't track the migration here:
https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1146
This is important because later in confirm_resize (on the source host)
when it calls RT.drop_move_claim:
https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4014
That will only update resource usage and decrement the old flavor if
it's a tracked migration:
https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L478
As noted from the TODO in the elif block below:
https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L489
This is semi-low priority given how latent it is and the fact it's
self-healing since the next run of the update_available_resource
periodic will fix the resource usage on the source host, but in a busy
cloud it could mean the difference between a server being able to
build on that source host or not based on it's tracked resource usage.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1818914/+subscriptions
References