← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1755602] Re: Ironic computes may not be discovered when node count is less than compute count

 

** Changed in: tripleo
       Status: In Progress => Fix Released

** Changed in: nova/queens
       Status: Fix Committed => Fix Released

** Changed in: nova/pike
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1755602

Title:
  Ironic computes may not be discovered when node count is less than
  compute count

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) pike series:
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Fix Released
Status in tripleo:
  Fix Released

Bug description:
  In an ironic deployment being built from day zero, there is an
  ordering problem, which generates a race condition for operators.
  Consider this common example:

  At config time, you create and start three nova-compute services
  pointing at your ironic deployment. These three will be HA using the
  ironic driver's hash ring functionality. At config time, there are no
  ironic nodes present yet, which means running discover_hosts will
  create no host mappings.

  Next, a single ironic node is added, which is owned by one of the
  computes per the hash rules. At this point, you can run discover_hosts
  and whatever compute owns that node will get a host mapping. Then you
  add a second ironic node, which causes all three nova-computes to
  rebalance the hash ring. One or more of the ironic nodes will
  definitely land on one of the other nova-computes and will suddenly be
  unreachable because there is no host mapping until the next time
  discover_hosts is run. Since we track the "mapped" bit on compute
  nodes, and compute nodes move between hosts with ironic, we won't even
  notice that the new owner nova-compute needs a host mapping. In fact,
  we won't notice until we get lucky enough to land a never-mapped
  ironic node on a nova-compute for the first time and then run
  discover_hosts after that point.

  For an automated config management system, this is a lot of complexity
  to handle in order to generate a stable output of a working system. In
  many cases where you're using ironic to bootstrap another deployment
  (i.e. tripleo) the number of nodes may be small (less than the
  computes) for quite some time.

  There are a couple obvious options I see:

  1. Add a --and-services flag to nova-manage, which will also look for
  all nova-compute services in the cell and make sure those have
  mappings. This is ideal because we could get all services mapped at
  config time without even having to have an ironic node in place yet
  (which is not possible today). We can't do this efficiently right away
  because nova.services does not have a mapped flag, and thus the
  scheduler periodic should _not_ include services.

  2. We could unset compute_node.mapped any time we re-home an ironic
  node to a different nova-compute. This would cause our scheduler
  periodic to notice the change and create a host mapping if it happens
  to move to an unmapped nova-compute. This generates extra work during
  normal operating state and also still leaves us with an interval of
  time where a previously-usable ironic node becomes unusable until the
  host discovery periodic task runs again.

  IMHO, we should do #1. It's a backportable change, and it's actually a
  better workflow for config automation tools than what we have today,
  even discounting this race. We can do what we did before, which is do
  it once for backports, and then add a mapped bit in master to make it
  more efficient, allowing it to be included in the scheduler periodic
  task.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1755602/+subscriptions


References