yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #96401
[Bug 2122036] Re: /os-hypervisors/detail takes too long to complete for 2.88 microversion
Reviewed: https://review.opendev.org/c/openstack/nova/+/959604
Committed: https://opendev.org/openstack/nova/commit/567dbe1867602d544945b3584c3885ac146b6535
Submitter: "Zuul (22348)"
Branch: master
commit 567dbe1867602d544945b3584c3885ac146b6535
Author: Sean Mooney <work@xxxxxxxxxxxxxxx>
Date: Thu Sep 4 21:42:04 2025 +0100
hypervisors: Optimize uptime retrieval for better performance
The /os-hypervisors/detail API endpoint was experiencing significant
performance issues in environments with many compute nodes when using
microversion 2.88 or higher, as it made sequential RPC calls to gather
uptime information from each compute node.
This change optimizes uptime retrieval by:
* Adding uptime to periodic resource updates sent by nova-compute to the
database, eliminating synchronous RPC calls during API requests
* Restricting RPC-based uptime retrieval to hypervisor types that support
it (libvirt and z/VM), avoiding unnecessary calls that would always fail
* Preferring cached database uptime data over RPC calls when available
Closes-Bug: #2122036
Assisted-By: Claude <noreply@xxxxxxxxxxxxx>
Change-Id: I5723320f578192f7e0beead7d5df5d7e47d54d2b
Co-Authored-By: Sylvain Bauza <sbauza@xxxxxxxxxx>
Signed-off-by: Sean Mooney <work@xxxxxxxxxxxxxxx>
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2122036
Title:
/os-hypervisors/detail takes too long to complete for 2.88
microversion
Status in OpenStack Compute (nova):
Fix Released
Bug description:
To Reproduce Steps to reproduce the behavior:
In Antelope environment with huge number of compute nodes run "openstack hypervisor list" command. It could take more that 40 seconds to complete and provide an output.
Expected behavior
Command is completed quickly by default, extra delays are expected when operator explicitly asks for extra data.
Bug impact
May block command from completion with default timeouts (it will fail before because HAProxy will return 504). Also, we shouldn't likely activate time-consuming options by default.
Known workaround
Specify earlier API version (2.68 for example)
---
There is another independent case that can cause slowness. The uptime
RPC only called on computes that are considered up, but if the compute
is down, but such fact is not yet detected by the conductor due to the
missing hearthbeat then the the RPC is sent but never answered causing
unnecessary delay in the API response.
---
The slowness is due to 2.88 hypervisor/details includes the compute
uptime and nova gathers that by RPC calling down to each computes
sequentially.
Older microversion should be use as a workaround where uptime is not
part of that API
As a future mitigation we should implement a periodic task in nova-
compute that periodically reports the uptime to the compute_nodes.stas
json blob into the cell DB in a new service version. And change the
API to query RPC down to the compute if the service version is old. If
the service version is new enough then the API can use the data
directly from the DB.
If we don't introduce a service version but instead use the existence
of the field in the json blob as a condition then we can probably make
the feature backportable.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2122036/+subscriptions
References