yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #69866
[Bug 1714248] Re: Compute node HA for ironic doesn't work due to the name duplication of Resource Provider
Reviewed: https://review.openstack.org/508555
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e3c5e22d1fde7ca916a8cc364f335fba8a3a798f
Submitter: Zuul
Branch: master
commit e3c5e22d1fde7ca916a8cc364f335fba8a3a798f
Author: John Garbutt <john@xxxxxxxxxxxxxxx>
Date: Fri Sep 29 15:48:54 2017 +0100
Re-use existing ComputeNode on ironic rebalance
When a nova-compute service dies that is one of several ironic based
nova-compute services running, a node rebalance occurs to ensure there
is still an active nova-compute service dealing with requests for the
given instance that is running.
Today, when this occurs, we create a new ComputeNode entry. This change
alters that logic to detect the case of the ironic node rebalance and in
that case we re-use the existing ComputeNode entry, simply updating the
host field to match the new host it has been rebalanced onto.
Previously we hit problems with placement when we get a new
ComputeNode.uuid for the same ironic_node.uuid. This reusing of the
existing entry keeps the ComputeNode.uuid the same when the rebalance of
the ComputeNode occurs.
Without keeping the same ComputeNode.uuid placement errors out with a 409
because we attempt to create a ResourceProvider that has the same name
as an existing ResourceProvdier. Had that worked, we would have noticed
the race that occurs after we create the ResourceProvider but before we
add back the existing allocations for existing instances. Keeping the
ComputeNode.uuid the same means we simply look up the existing
ResourceProvider in placement, avoiding all this pain and tears.
Closes-Bug: #1714248
Co-Authored-By: Dmitry Tantsur <dtantsur@xxxxxxxxxx>
Change-Id: I4253cffca3dbf558c875eed7e77711a31e9e3406
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1714248
Title:
Compute node HA for ironic doesn't work due to the name duplication of
Resource Provider
Status in Ironic:
Invalid
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) pike series:
In Progress
Bug description:
Description
===========
In an environment where there are multiple compute nodes with ironic driver,
when a compute node goes down, another compute node cannot take over ironic nodes.
Steps to reproduce
==================
1. Start multiple compute nodes with ironic driver.
2. Register one node to ironic.
2. Stop a compute node which manages the ironic node.
3. Create an instance.
Expected result
===============
The instance is created.
Actual result
=============
The instance creation is failed.
Environment
===========
1. Exact version of OpenStack you are running.
openstack-nova-scheduler-15.0.6-2.el7.noarch
openstack-nova-console-15.0.6-2.el7.noarch
python2-novaclient-7.1.0-1.el7.noarch
openstack-nova-common-15.0.6-2.el7.noarch
openstack-nova-serialproxy-15.0.6-2.el7.noarch
openstack-nova-placement-api-15.0.6-2.el7.noarch
python-nova-15.0.6-2.el7.noarch
openstack-nova-novncproxy-15.0.6-2.el7.noarch
openstack-nova-api-15.0.6-2.el7.noarch
openstack-nova-conductor-15.0.6-2.el7.noarch
2. Which hypervisor did you use?
ironic
Details
=======
When a nova-compute goes down, another nova-compute will take over ironic nodes managed by the failed nova-compute by re-balancing a hash-ring. Then the active nova-compute tries creating a
new resource provider with a new ComputeNode object UUID and the hypervisor name (ironic node UUID)[1][2][3]. This creation fails with a conflict(409) since there is a resource provider with the same name created by the failed nova-compute.
When a new instance is requested, the scheduler gets only an old
resource provider for the ironic node[4]. Then, the ironic node is not
selected:
WARNING nova.scheduler.filters.compute_filter [req-
a37d68b5-7ab1-4254-8698-502304607a90 7b55e61a07304f9cab1544260dcd3e41
e21242f450d948d7af2650ac9365ee36 - - -] (compute02, 8904aeeb-a35b-4ba3
-848a-73269fdde4d3) ram: 4096MB disk: 849920MB io_ops: 0 instances: 0
has not been heard from in a while
[1] https://github.com/openstack/nova/blob/stable/ocata/nova/compute/resource_tracker.py#L464
[2] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L630
[3] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L410
[4] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/filter_scheduler.py#L183
To manage notifications about this bug go to:
https://bugs.launchpad.net/ironic/+bug/1714248/+subscriptions
References