yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #87024
[Bug 1853009] Re: Ironic node rebalance race can lead to missing compute nodes in DB
Reviewed: https://review.opendev.org/c/openstack/nova/+/694802
Committed: https://opendev.org/openstack/nova/commit/a8492e88783b40f6dc61888fada232f0d00d6acf
Submitter: "Zuul (22348)"
Branch: master
commit a8492e88783b40f6dc61888fada232f0d00d6acf
Author: Mark Goddard <mark@xxxxxxxxxxxx>
Date: Mon Nov 18 12:06:47 2019 +0000
Prevent deletion of a compute node belonging to another host
There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.
The main race condition involved is in update_available_resources in
the compute manager. When the list of compute nodes is queried, there is
a compute node belonging to the host that it does not expect to be
managing, i.e. it is an orphan. Between that time and deleting the
orphan, the real owner of the compute node takes ownership of it ( in
the resource tracker). However, the node is still deleted as the first
host is unaware of the ownership change.
This change prevents this from occurring by filtering on the host when
deleting a compute node. If another compute host has taken ownership of
a node, it will have updated the host field and this will prevent
deletion from occurring. The first host sees this has happened via the
ComputeHostNotFound exception, and avoids deleting its resource
provider.
Co-Authored-By: melanie witt <melwittt@xxxxxxxxx>
Closes-Bug: #1853009
Related-Bug: #1841481
Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1853009
Title:
Ironic node rebalance race can lead to missing compute nodes in DB
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) ocata series:
New
Status in OpenStack Compute (nova) pike series:
New
Status in OpenStack Compute (nova) queens series:
New
Status in OpenStack Compute (nova) rocky series:
New
Status in OpenStack Compute (nova) stein series:
New
Status in OpenStack Compute (nova) train series:
New
Status in OpenStack Compute (nova) ussuri series:
In Progress
Bug description:
There is a race condition in nova-compute with the ironic virt driver
as nodes get rebalanced. It can lead to compute nodes being removed in
the DB and not repopulated. Ultimately this prevents these nodes from
being scheduled to.
Steps to reproduce
==================
* Deploy nova with multiple nova-compute services managing ironic.
* Create some bare metal nodes in ironic, and make them 'available' (does not work if they are 'active')
* Stop all nova-compute services
* Wait for all nova-compute services to be DOWN in 'openstack compute service list'
* Simultaneously start all nova-compute services
Expected results
================
All ironic nodes appear as hypervisors in 'openstack hypervisor list'
Actual results
==============
One or more nodes may be missing from 'openstack hypervisor list'.
This is most easily checked via 'openstack hypervisor list | wc -l'
Environment
===========
OS: CentOS 7.6
Hypervisor: ironic
Nova: 18.2.0, plus a handful of backported patches
Logs
====
I grabbed some relevant logs from one incident of this issue. They are
split between two compute services, and I have tried to make that
clear, including a summary of what happened at each point.
http://paste.openstack.org/show/786272/
tl;dr
c3: 19:14:55 Finds no compute record in RT. Tries to create one (_init_compute_node). Shows traceback with SQL rollback but seems to succeed
c1: 19:14:56 Finds no compute record in RT, ‘moves’ existing node from c3
c1: 19:15:54 Begins periodic update, queries compute nodes for this host, finds the node
c3: 19:15:54 Finds no compute record in RT, ‘moves’ existing node from c1
c1: 19:15:55 Deletes orphan compute node (which now belongs to c3)
c3: 19:16:56 Creates resource provider
c3; 19:17:56 Uses existing resource provider
There are two major problems here:
* c1 deletes the orphan node after c3 has taken ownership of it
* c3 assumes that another compute service will not delete its nodes.
Once a node is in rt.compute_nodes, it is not removed again unless the
node is orphaned
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1853009/+subscriptions
References