yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1714248] [NEW] Compute node HA for ironic doesn't work due to the name duplication of Resource Provider

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Hironori Shiina <shiina.hironori@xxxxxxxxxxxxxx>
Date: Thu, 31 Aug 2017 13:30:42 -0000
Reply-to: Bug 1714248 <1714248@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Description
===========
In an environment where there are multiple compute nodes with ironic driver, 
when a compute node goes down, another compute node cannot take over ironic nodes.

Steps to reproduce
==================
1. Start multiple compute nodes with ironic driver.
2. Register one node to ironic.
2. Stop a compute node which manages the ironic node.
3. Create an instance.

Expected result
===============
The instance creation is failed.

Actual result
=============
The instance is created.

Environment
===========
1. Exact version of OpenStack you are running.
openstack-nova-scheduler-15.0.6-2.el7.noarch
openstack-nova-console-15.0.6-2.el7.noarch
python2-novaclient-7.1.0-1.el7.noarch
openstack-nova-common-15.0.6-2.el7.noarch
openstack-nova-serialproxy-15.0.6-2.el7.noarch
openstack-nova-placement-api-15.0.6-2.el7.noarch
python-nova-15.0.6-2.el7.noarch
openstack-nova-novncproxy-15.0.6-2.el7.noarch
openstack-nova-api-15.0.6-2.el7.noarch
openstack-nova-conductor-15.0.6-2.el7.noarch

2. Which hypervisor did you use?
ironic

Details
=======
When a nova-compute goes down, another nova-compute will take over ironic nodes managed by the failed nova-compute by re-balancing a hash-ring. Then the active nova-compute tries creating a
new resource provider with a new ComputeNode object UUID and the hypervisor name (ironic node name)[1][2][3]. This creation fails with a conflict(409) since there is a resource provider with the same name created by the failed nova-compute.

When a new instance is requested, the scheduler gets only an old
resource provider for the ironic node[4]. Then, the ironic node is not
selected:

WARNING nova.scheduler.filters.compute_filter [req-
a37d68b5-7ab1-4254-8698-502304607a90 7b55e61a07304f9cab1544260dcd3e41
e21242f450d948d7af2650ac9365ee36 - - -] (compute02, 8904aeeb-a35b-4ba3
-848a-73269fdde4d3) ram: 4096MB disk: 849920MB io_ops: 0 instances: 0
has not been heard from in a while

[1] https://github.com/openstack/nova/blob/stable/ocata/nova/compute/resource_tracker.py#L464
[2] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L630
[3] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L410
[4] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/filter_scheduler.py#L183

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1714248

Title:
  Compute node HA for ironic doesn't work due to the name duplication of
  Resource Provider

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  In an environment where there are multiple compute nodes with ironic driver, 
  when a compute node goes down, another compute node cannot take over ironic nodes.

  Steps to reproduce
  ==================
  1. Start multiple compute nodes with ironic driver.
  2. Register one node to ironic.
  2. Stop a compute node which manages the ironic node.
  3. Create an instance.

  Expected result
  ===============
  The instance creation is failed.

  Actual result
  =============
  The instance is created.

  Environment
  ===========
  1. Exact version of OpenStack you are running.
  openstack-nova-scheduler-15.0.6-2.el7.noarch
  openstack-nova-console-15.0.6-2.el7.noarch
  python2-novaclient-7.1.0-1.el7.noarch
  openstack-nova-common-15.0.6-2.el7.noarch
  openstack-nova-serialproxy-15.0.6-2.el7.noarch
  openstack-nova-placement-api-15.0.6-2.el7.noarch
  python-nova-15.0.6-2.el7.noarch
  openstack-nova-novncproxy-15.0.6-2.el7.noarch
  openstack-nova-api-15.0.6-2.el7.noarch
  openstack-nova-conductor-15.0.6-2.el7.noarch

  2. Which hypervisor did you use?
  ironic

  Details
  =======
  When a nova-compute goes down, another nova-compute will take over ironic nodes managed by the failed nova-compute by re-balancing a hash-ring. Then the active nova-compute tries creating a
  new resource provider with a new ComputeNode object UUID and the hypervisor name (ironic node name)[1][2][3]. This creation fails with a conflict(409) since there is a resource provider with the same name created by the failed nova-compute.

  When a new instance is requested, the scheduler gets only an old
  resource provider for the ironic node[4]. Then, the ironic node is not
  selected:

  WARNING nova.scheduler.filters.compute_filter [req-
  a37d68b5-7ab1-4254-8698-502304607a90 7b55e61a07304f9cab1544260dcd3e41
  e21242f450d948d7af2650ac9365ee36 - - -] (compute02, 8904aeeb-a35b-4ba3
  -848a-73269fdde4d3) ram: 4096MB disk: 849920MB io_ops: 0 instances: 0
  has not been heard from in a while

  [1] https://github.com/openstack/nova/blob/stable/ocata/nova/compute/resource_tracker.py#L464
  [2] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L630
  [3] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L410
  [4] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/filter_scheduler.py#L183

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1714248/+subscriptions

Follow ups

[Bug 1714248] Re: Compute node HA for ironic doesn't work due to the name duplication of Resource Provider
From: Matt Riedemann, 2018-10-03
[Bug 1714248] Re: Compute node HA for ironic doesn't work due to the name duplication of Resource Provider
From: OpenStack Infra, 2017-12-13
[Bug 1714248] Re: Compute node HA for ironic doesn't work due to the name duplication of Resource Provider
From: Matt Riedemann, 2017-12-12
[Bug 1714248] Re: Compute node HA for ironic doesn't work due to the name duplication of Resource Provider
From: Dmitry Tantsur, 2017-12-11