← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2122036] Re: /os-hypervisors/detail takes too long to complete for 2.88 microversion

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/959604
Committed: https://opendev.org/openstack/nova/commit/567dbe1867602d544945b3584c3885ac146b6535
Submitter: "Zuul (22348)"
Branch:    master

commit 567dbe1867602d544945b3584c3885ac146b6535
Author: Sean Mooney <work@xxxxxxxxxxxxxxx>
Date:   Thu Sep 4 21:42:04 2025 +0100

    hypervisors: Optimize uptime retrieval for better performance
    
    The /os-hypervisors/detail API endpoint was experiencing significant
    performance issues in environments with many compute nodes when using
    microversion 2.88 or higher, as it made sequential RPC calls to gather
    uptime information from each compute node.
    
    This change optimizes uptime retrieval by:
    
    * Adding uptime to periodic resource updates sent by nova-compute to the
      database, eliminating synchronous RPC calls during API requests
    * Restricting RPC-based uptime retrieval to hypervisor types that support
      it (libvirt and z/VM), avoiding unnecessary calls that would always fail
    * Preferring cached database uptime data over RPC calls when available
    
    Closes-Bug: #2122036
    Assisted-By: Claude <noreply@xxxxxxxxxxxxx>
    Change-Id: I5723320f578192f7e0beead7d5df5d7e47d54d2b
    Co-Authored-By: Sylvain Bauza <sbauza@xxxxxxxxxx>
    Signed-off-by: Sean Mooney <work@xxxxxxxxxxxxxxx>


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2122036

Title:
  /os-hypervisors/detail takes too long to complete for 2.88
  microversion

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  To Reproduce Steps to reproduce the behavior:
  In Antelope environment with huge number of compute nodes run "openstack hypervisor list" command. It could take more that 40 seconds to complete and provide an output.

  Expected behavior
  Command is completed quickly by default, extra delays are expected when operator explicitly asks for extra data.

  Bug impact
  May block command from completion with default timeouts (it will fail before because HAProxy will return 504). Also, we shouldn't likely activate time-consuming options by default.

  Known workaround
  Specify earlier API version (2.68 for example)

  ---

  There is another independent case that can cause slowness. The uptime
  RPC only called on computes that are considered up, but if the compute
  is down, but such fact is not yet detected by the conductor due to the
  missing hearthbeat then the the RPC is sent but never answered causing
  unnecessary delay in the API response.

  ---

  The slowness is due to 2.88 hypervisor/details includes the compute
  uptime and nova gathers that by RPC calling down to each computes
  sequentially.

  Older microversion should be use as a workaround where uptime is not
  part of that API

  As a future mitigation we should implement a periodic task in nova-
  compute that periodically reports the uptime to the compute_nodes.stas
  json blob into the cell DB in a new service version. And change the
  API to query RPC down to the compute if the service version is old. If
  the service version is new enough then the API can use the data
  directly from the DB.

  If we don't introduce a service version but instead use the existence
  of the field in the json blob as a condition then we can probably make
  the feature backportable.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2122036/+subscriptions



References