yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #71767
[Bug 1755602] Re: Ironic computes may not be discovered when node count is less than compute count
Reviewed: https://review.openstack.org/552691
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=005a66d7e0bb716e32d29a6b5c9d9f24192596e2
Submitter: Zuul
Branch: master
commit 005a66d7e0bb716e32d29a6b5c9d9f24192596e2
Author: Dan Smith <dansmith@xxxxxxxxxx>
Date: Tue Mar 13 14:42:09 2018 -0700
Add --by-service to discover_hosts
This allows us to discover and map compute hosts by service instead of
by compute node, which will solve a major deployment ordering problem for
people using ironic. This also allows closing a really nasty race when
doing HA of nova-compute/ironic.
Change-Id: Ie9f064cb9caf6dcba2414acb24d12b825df45fab
Closes-Bug: #1755602
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1755602
Title:
Ironic computes may not be discovered when node count is less than
compute count
Status in OpenStack Compute (nova):
Fix Released
Status in tripleo:
In Progress
Bug description:
In an ironic deployment being built from day zero, there is an
ordering problem, which generates a race condition for operators.
Consider this common example:
At config time, you create and start three nova-compute services
pointing at your ironic deployment. These three will be HA using the
ironic driver's hash ring functionality. At config time, there are no
ironic nodes present yet, which means running discover_hosts will
create no host mappings.
Next, a single ironic node is added, which is owned by one of the
computes per the hash rules. At this point, you can run discover_hosts
and whatever compute owns that node will get a host mapping. Then you
add a second ironic node, which causes all three nova-computes to
rebalance the hash ring. One or more of the ironic nodes will
definitely land on one of the other nova-computes and will suddenly be
unreachable because there is no host mapping until the next time
discover_hosts is run. Since we track the "mapped" bit on compute
nodes, and compute nodes move between hosts with ironic, we won't even
notice that the new owner nova-compute needs a host mapping. In fact,
we won't notice until we get lucky enough to land a never-mapped
ironic node on a nova-compute for the first time and then run
discover_hosts after that point.
For an automated config management system, this is a lot of complexity
to handle in order to generate a stable output of a working system. In
many cases where you're using ironic to bootstrap another deployment
(i.e. tripleo) the number of nodes may be small (less than the
computes) for quite some time.
There are a couple obvious options I see:
1. Add a --and-services flag to nova-manage, which will also look for
all nova-compute services in the cell and make sure those have
mappings. This is ideal because we could get all services mapped at
config time without even having to have an ironic node in place yet
(which is not possible today). We can't do this efficiently right away
because nova.services does not have a mapped flag, and thus the
scheduler periodic should _not_ include services.
2. We could unset compute_node.mapped any time we re-home an ironic
node to a different nova-compute. This would cause our scheduler
periodic to notice the change and create a host mapping if it happens
to move to an unmapped nova-compute. This generates extra work during
normal operating state and also still leaves us with an interval of
time where a previously-usable ironic node becomes unusable until the
host discovery periodic task runs again.
IMHO, we should do #1. It's a backportable change, and it's actually a
better workflow for config automation tools than what we have today,
even discounting this race. We can do what we did before, which is do
it once for backports, and then add a mapped bit in master to make it
more efficient, allowing it to be included in the scheduler periodic
task.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1755602/+subscriptions
References