yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93326
[Bug 2043036] Re: [ironic] list_instances/list_instance_uuid does not respect conductor_group/partition_key
Reviewed: https://review.opendev.org/c/openstack/nova/+/900831
Committed: https://opendev.org/openstack/nova/commit/fa3cf7d50cba921ea67eb161e6a199067ea62deb
Submitter: "Zuul (22348)"
Branch: master
commit fa3cf7d50cba921ea67eb161e6a199067ea62deb
Author: Jay Faulkner <jay@xxxxxx>
Date: Mon Nov 13 15:21:31 2023 -0800
[ironic] Partition & use cache for list_instance*
list_instances and list_instance_uuids, as written in the Ironic driver,
do not currently respect conductor_group paritioning. Given a nova
compute is intended to limit it's scope of work to the conductor group
it is configured to work with; this is a bug.
Additionally, this should be a significant performance boost for a
couple of reasons; firstly, instead of calling the Ironic API and
getting all nodes, instead of the subset (when using conductor group),
we're now properly getting the subset of nodes -- this is the optimized
path in the Ironic DB and API code. Secondly, we're now using the
driver's node cache to respond to these requests. Since list_instances
and list_instance_uuids is used by periodic tasks, these operating with
data that may be slightly stale should have minimal impact compared to
the performance benefits.
Closes-bug: #2043036
Change-Id: If31158e3269e5e06848c29294fdaa147beedb5a5
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2043036
Title:
[ironic] list_instances/list_instance_uuid does not respect
conductor_group/partition_key
Status in Ironic:
Triaged
Status in OpenStack Compute (nova):
Fix Released
Bug description:
The methods on the Ironic driver, list_instances and
list_instance_uuids are not currently respecting the conductor_group
option:
https://opendev.org/openstack/nova/src/branch/master/nova/conf/ironic.py#L71.
This leads to significant performance degradation, as querying Ironic
for all nodes (/v1/nodes) instead of all nodes managed by the compute
(/v1/nodes?conductor_group=blah) is a significantly more expensive API
call.
In addition, this can lead to unexpected behavior for operators, such
as an action being taken by a compute serving conductor group "A" to
resolve an issue that would normally be resolved by a compute service
conductor group "B".
While troubleshooting this error, we dug deeply into what this data is used for; it's used for two things:
- Reconciling deleted instances as a periodic job
- Ensuring no instances exist on a newly-started compute host
These are tasks which either could use stale data or would not be impacted by using the Ironic driver's existing node cache. Therefore, a suggested fix is:
Revise list_instances and list_instance_uuids to reuse the node cache
to reduce the overall API calls being made to Ironic, and ensure all
/v1/nodes calls use the same codepath in the Ironic driver. It's the
belief of JayF, TheJulia, and Johnthetubaguy (on a video call right
now) that using stale data, without refreshing the cache, should be
safe for these use cases. (Even if we decide to refresh the cache, we
should use this code path anyway.)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ironic/+bug/2043036/+subscriptions
References