← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1974070] Re: Ironic builds fail when landing on a cleaning node, it doesn't try to reschedule

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/864773
Committed: https://opendev.org/openstack/nova/commit/3c022e968375c1b2eadf3c2dd7190b9434c6d4c1
Submitter: "Zuul (22348)"
Branch:    master

commit 3c022e968375c1b2eadf3c2dd7190b9434c6d4c1
Author: John Garbutt <john.garbutt@xxxxxxxxxxxx>
Date:   Wed Nov 16 17:12:40 2022 +0000

    Ironic nodes with instance reserved in placement
    
    Currently, when you delete an ironic instance, we trigger
    and undeploy in ironic and we release our allocation in placement.
    We do this well before the ironic node is actually available.
    
    We have attempted to fix this my marking unavailable nodes
    as reserved in placement. This works great until you try
    and re-image lots of nodes.
    
    It turns out, ironic nodes that are waiting for their automatic
    clean to finish, are returned as a valid allocation candidates
    for quite some time. Eventually we mark then as reserved.
    
    This patch takes a strange approach, if we mark all nodes as
    reserved as soon as the instance lands, we close the race.
    That is, when the allocation is removed the node is still
    unavailable until the next update of placement is done and
    notices that the node has become available. That may or may
    not have been after automatic cleaning. The trade off is
    that when you don't have automatic cleaning, we wait a bit
    longer to notice the node is available again.
    
    Note, this is also useful when a broken Ironic node is
    marked as in-maintainance while it is in-use by a nova
    instance. In a similar way, we mark the Nova as reserved
    immmeidately, rather than first waiting for the instance to be
    deleted before reserving the resources in Placement.
    
    Closes-Bug: #1974070
    Change-Id: Iab92124b5776a799c7f90d07281d28fcf191c8fe


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1974070

Title:
  Ironic builds fail when landing on a cleaning node, it doesn't try to
  reschedule

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  In a happy world, placement reserved gets updated when a node is not
  availabe any more, so the scheduler doesn't pick that one, everyone is
  happy.

  Howerver, as is fairly well known, it takes a while for Nova to notice
  if a node has been marked as in maintenance or if it has started
  cleaning due to the instance now having been deleted, and you can
  still reach a node in a bad state.

  This actually fails hard when setting the instance uuid, as expected here:
  https://github.com/openstack/nova/blob/4939318649650b60dd07d161b80909e70d0e093e/nova/virt/ironic/driver.py#L378

  You get a conflict errors, as the ironic node is in a transitioning
  state (i.e. its not actually available any more).

  When people are busy rebuilding large numbers of nodes, they tend to
  hit this problem, even when only building when you know there
  available nodes, you sometimes pick the ones you just deleted.

  In an idea world this would trigger a re-schedule, a bit like when you
  hit errors in the resource tracker such as ComputeResourcesUnavailable

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1974070/+subscriptions



References