yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1853009] Re: Ironic node rebalance race can lead to missing compute nodes in DB

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1853009@xxxxxxxxxxxxxxxxxx>
Date: Mon, 30 Aug 2021 17:15:48 -0000
Reply-to: Bug 1853009 <1853009@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Reviewed:  https://review.opendev.org/c/openstack/nova/+/694802
Committed: https://opendev.org/openstack/nova/commit/a8492e88783b40f6dc61888fada232f0d00d6acf
Submitter: "Zuul (22348)"
Branch:    master

commit a8492e88783b40f6dc61888fada232f0d00d6acf
Author: Mark Goddard <mark@xxxxxxxxxxxx>
Date:   Mon Nov 18 12:06:47 2019 +0000

    Prevent deletion of a compute node belonging to another host
    
    There is a race condition in nova-compute with the ironic virt driver as
    nodes get rebalanced. It can lead to compute nodes being removed in the
    DB and not repopulated. Ultimately this prevents these nodes from being
    scheduled to.
    
    The main race condition involved is in update_available_resources in
    the compute manager. When the list of compute nodes is queried, there is
    a compute node belonging to the host that it does not expect to be
    managing, i.e. it is an orphan. Between that time and deleting the
    orphan, the real owner of the compute node takes ownership of it ( in
    the resource tracker). However, the node is still deleted as the first
    host is unaware of the ownership change.
    
    This change prevents this from occurring by filtering on the host when
    deleting a compute node. If another compute host has taken ownership of
    a node, it will have updated the host field and this will prevent
    deletion from occurring. The first host sees this has happened via the
    ComputeHostNotFound exception, and avoids deleting its resource
    provider.
    
    Co-Authored-By: melanie witt <melwittt@xxxxxxxxx>
    
    Closes-Bug: #1853009
    Related-Bug: #1841481
    
    Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1853009

Title:
  Ironic node rebalance race can lead to missing compute nodes in DB

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) ocata series:
  New
Status in OpenStack Compute (nova) pike series:
  New
Status in OpenStack Compute (nova) queens series:
  New
Status in OpenStack Compute (nova) rocky series:
  New
Status in OpenStack Compute (nova) stein series:
  New
Status in OpenStack Compute (nova) train series:
  New
Status in OpenStack Compute (nova) ussuri series:
  In Progress

Bug description:
  There is a race condition in nova-compute with the ironic virt driver
  as nodes get rebalanced. It can lead to compute nodes being removed in
  the DB and not repopulated. Ultimately this prevents these nodes from
  being scheduled to.

  Steps to reproduce
  ==================

  * Deploy nova with multiple nova-compute services managing ironic.
  * Create some bare metal nodes in ironic, and make them 'available' (does not work if they are 'active')
  * Stop all nova-compute services
  * Wait for all nova-compute services to be DOWN in 'openstack compute service list'
  * Simultaneously start all nova-compute services

  Expected results
  ================

  All ironic nodes appear as hypervisors in 'openstack hypervisor list'

  Actual results
  ==============

  One or more nodes may be missing from 'openstack hypervisor list'.
  This is most easily checked via 'openstack hypervisor list | wc -l'

  Environment
  ===========

  OS: CentOS 7.6
  Hypervisor: ironic
  Nova: 18.2.0, plus a handful of backported patches

  Logs
  ====

  I grabbed some relevant logs from one incident of this issue. They are
  split between two compute services, and I have tried to make that
  clear, including a summary of what happened at each point.

  http://paste.openstack.org/show/786272/

  tl;dr

  
  c3: 19:14:55 Finds no compute record in RT. Tries to create one (_init_compute_node). Shows traceback with SQL rollback but seems to succeed
  c1: 19:14:56 Finds no compute record in RT, ‘moves’ existing node from c3
  c1: 19:15:54 Begins periodic update, queries compute nodes for this host, finds the node
  c3: 19:15:54 Finds no compute record in RT, ‘moves’ existing node from c1
  c1: 19:15:55 Deletes orphan compute node (which now belongs to c3)
  c3: 19:16:56 Creates resource provider
  c3; 19:17:56 Uses existing resource provider

  There are two major problems here:

  * c1 deletes the orphan node after c3 has taken ownership of it

  * c3 assumes that another compute service will not delete its nodes.
  Once a node is in rt.compute_nodes, it is not removed again unless the
  node is orphaned

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1853009/+subscriptions
References

[Bug 1853009] [NEW] Ironic node rebalance race can lead to missing compute nodes in DB
From: Mark Goddard, 2019-11-18