yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #71711
[Bug 1755602] Re: Ironic computes may not be discovered when node count is less than compute count
** Also affects: tripleo
Importance: Undecided
Status: New
** Changed in: tripleo
Assignee: (unassigned) => Oliver Walsh (owalsh)
** Changed in: tripleo
Milestone: None => rocky-1
** Changed in: tripleo
Status: New => In Progress
** Changed in: tripleo
Importance: Undecided => Medium
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1755602
Title:
Ironic computes may not be discovered when node count is less than
compute count
Status in OpenStack Compute (nova):
In Progress
Status in tripleo:
In Progress
Bug description:
In an ironic deployment being built from day zero, there is an
ordering problem, which generates a race condition for operators.
Consider this common example:
At config time, you create and start three nova-compute services
pointing at your ironic deployment. These three will be HA using the
ironic driver's hash ring functionality. At config time, there are no
ironic nodes present yet, which means running discover_hosts will
create no host mappings.
Next, a single ironic node is added, which is owned by one of the
computes per the hash rules. At this point, you can run discover_hosts
and whatever compute owns that node will get a host mapping. Then you
add a second ironic node, which causes all three nova-computes to
rebalance the hash ring. One or more of the ironic nodes will
definitely land on one of the other nova-computes and will suddenly be
unreachable because there is no host mapping until the next time
discover_hosts is run. Since we track the "mapped" bit on compute
nodes, and compute nodes move between hosts with ironic, we won't even
notice that the new owner nova-compute needs a host mapping. In fact,
we won't notice until we get lucky enough to land a never-mapped
ironic node on a nova-compute for the first time and then run
discover_hosts after that point.
For an automated config management system, this is a lot of complexity
to handle in order to generate a stable output of a working system. In
many cases where you're using ironic to bootstrap another deployment
(i.e. tripleo) the number of nodes may be small (less than the
computes) for quite some time.
There are a couple obvious options I see:
1. Add a --and-services flag to nova-manage, which will also look for
all nova-compute services in the cell and make sure those have
mappings. This is ideal because we could get all services mapped at
config time without even having to have an ironic node in place yet
(which is not possible today). We can't do this efficiently right away
because nova.services does not have a mapped flag, and thus the
scheduler periodic should _not_ include services.
2. We could unset compute_node.mapped any time we re-home an ironic
node to a different nova-compute. This would cause our scheduler
periodic to notice the change and create a host mapping if it happens
to move to an unmapped nova-compute. This generates extra work during
normal operating state and also still leaves us with an interval of
time where a previously-usable ironic node becomes unusable until the
host discovery periodic task runs again.
IMHO, we should do #1. It's a backportable change, and it's actually a
better workflow for config automation tools than what we have today,
even discounting this race. We can do what we did before, which is do
it once for backports, and then add a mapped bit in master to make it
more efficient, allowing it to be included in the scheduler periodic
task.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1755602/+subscriptions
References