← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1853009] Re: Ironic node rebalance race can lead to missing compute nodes in DB

 

** Also affects: nova/stein
   Importance: Undecided
       Status: New

** Also affects: nova/ocata
   Importance: Undecided
       Status: New

** Also affects: nova/pike
   Importance: Undecided
       Status: New

** Also affects: nova/ussuri
   Importance: High
     Assignee: Mark Goddard (mgoddard)
       Status: In Progress

** Also affects: nova/rocky
   Importance: Undecided
       Status: New

** Also affects: nova/train
   Importance: Undecided
       Status: New

** Also affects: nova/queens
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1853009

Title:
  Ironic node rebalance race can lead to missing compute nodes in DB

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) ocata series:
  New
Status in OpenStack Compute (nova) pike series:
  New
Status in OpenStack Compute (nova) queens series:
  New
Status in OpenStack Compute (nova) rocky series:
  New
Status in OpenStack Compute (nova) stein series:
  New
Status in OpenStack Compute (nova) train series:
  New
Status in OpenStack Compute (nova) ussuri series:
  In Progress

Bug description:
  There is a race condition in nova-compute with the ironic virt driver
  as nodes get rebalanced. It can lead to compute nodes being removed in
  the DB and not repopulated. Ultimately this prevents these nodes from
  being scheduled to.

  Steps to reproduce
  ==================

  * Deploy nova with multiple nova-compute services managing ironic.
  * Create some bare metal nodes in ironic, and make them 'available' (does not work if they are 'active')
  * Stop all nova-compute services
  * Wait for all nova-compute services to be DOWN in 'openstack compute service list'
  * Simultaneously start all nova-compute services

  Expected results
  ================

  All ironic nodes appear as hypervisors in 'openstack hypervisor list'

  Actual results
  ==============

  One or more nodes may be missing from 'openstack hypervisor list'.
  This is most easily checked via 'openstack hypervisor list | wc -l'

  Environment
  ===========

  OS: CentOS 7.6
  Hypervisor: ironic
  Nova: 18.2.0, plus a handful of backported patches

  Logs
  ====

  I grabbed some relevant logs from one incident of this issue. They are
  split between two compute services, and I have tried to make that
  clear, including a summary of what happened at each point.

  http://paste.openstack.org/show/786272/

  tl;dr

  
  c3: 19:14:55 Finds no compute record in RT. Tries to create one (_init_compute_node). Shows traceback with SQL rollback but seems to succeed
  c1: 19:14:56 Finds no compute record in RT, ‘moves’ existing node from c3
  c1: 19:15:54 Begins periodic update, queries compute nodes for this host, finds the node
  c3: 19:15:54 Finds no compute record in RT, ‘moves’ existing node from c1
  c1: 19:15:55 Deletes orphan compute node (which now belongs to c3)
  c3: 19:16:56 Creates resource provider
  c3; 19:17:56 Uses existing resource provider

  There are two major problems here:

  * c1 deletes the orphan node after c3 has taken ownership of it

  * c3 assumes that another compute service will not delete its nodes.
  Once a node is in rt.compute_nodes, it is not removed again unless the
  node is orphaned

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1853009/+subscriptions


References