yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1714248] Re: Compute node HA for ironic doesn't work due to the name duplication of Resource Provider

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1714248@xxxxxxxxxxxxxxxxxx>
Date: Wed, 13 Dec 2017 13:41:19 -0000
Reply-to: Bug 1714248 <1714248@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Reviewed:  https://review.openstack.org/508555
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e3c5e22d1fde7ca916a8cc364f335fba8a3a798f
Submitter: Zuul
Branch:    master

commit e3c5e22d1fde7ca916a8cc364f335fba8a3a798f
Author: John Garbutt <john@xxxxxxxxxxxxxxx>
Date:   Fri Sep 29 15:48:54 2017 +0100

    Re-use existing ComputeNode on ironic rebalance
    
    When a nova-compute service dies that is one of several ironic based
    nova-compute services running, a node rebalance occurs to ensure there
    is still an active nova-compute service dealing with requests for the
    given instance that is running.
    
    Today, when this occurs, we create a new ComputeNode entry. This change
    alters that logic to detect the case of the ironic node rebalance and in
    that case we re-use the existing ComputeNode entry, simply updating the
    host field to match the new host it has been rebalanced onto.
    
    Previously we hit problems with placement when we get a new
    ComputeNode.uuid for the same ironic_node.uuid. This reusing of the
    existing entry keeps the ComputeNode.uuid the same when the rebalance of
    the ComputeNode occurs.
    
    Without keeping the same ComputeNode.uuid placement errors out with a 409
    because we attempt to create a ResourceProvider that has the same name
    as an existing ResourceProvdier. Had that worked, we would have noticed
    the race that occurs after we create the ResourceProvider but before we
    add back the existing allocations for existing instances. Keeping the
    ComputeNode.uuid the same means we simply look up the existing
    ResourceProvider in placement, avoiding all this pain and tears.
    
    Closes-Bug: #1714248
    Co-Authored-By: Dmitry Tantsur <dtantsur@xxxxxxxxxx>
    Change-Id: I4253cffca3dbf558c875eed7e77711a31e9e3406


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1714248

Title:
  Compute node HA for ironic doesn't work due to the name duplication of
  Resource Provider

Status in Ironic:
  Invalid
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) pike series:
  In Progress

Bug description:
  Description
  ===========
  In an environment where there are multiple compute nodes with ironic driver,
  when a compute node goes down, another compute node cannot take over ironic nodes.

  Steps to reproduce
  ==================
  1. Start multiple compute nodes with ironic driver.
  2. Register one node to ironic.
  2. Stop a compute node which manages the ironic node.
  3. Create an instance.

  Expected result
  ===============
  The instance is created.

  Actual result
  =============
  The instance creation is failed.

  Environment
  ===========
  1. Exact version of OpenStack you are running.
  openstack-nova-scheduler-15.0.6-2.el7.noarch
  openstack-nova-console-15.0.6-2.el7.noarch
  python2-novaclient-7.1.0-1.el7.noarch
  openstack-nova-common-15.0.6-2.el7.noarch
  openstack-nova-serialproxy-15.0.6-2.el7.noarch
  openstack-nova-placement-api-15.0.6-2.el7.noarch
  python-nova-15.0.6-2.el7.noarch
  openstack-nova-novncproxy-15.0.6-2.el7.noarch
  openstack-nova-api-15.0.6-2.el7.noarch
  openstack-nova-conductor-15.0.6-2.el7.noarch

  2. Which hypervisor did you use?
  ironic

  Details
  =======
  When a nova-compute goes down, another nova-compute will take over ironic nodes managed by the failed nova-compute by re-balancing a hash-ring. Then the active nova-compute tries creating a
  new resource provider with a new ComputeNode object UUID and the hypervisor name (ironic node UUID)[1][2][3]. This creation fails with a conflict(409) since there is a resource provider with the same name created by the failed nova-compute.

  When a new instance is requested, the scheduler gets only an old
  resource provider for the ironic node[4]. Then, the ironic node is not
  selected:

  WARNING nova.scheduler.filters.compute_filter [req-
  a37d68b5-7ab1-4254-8698-502304607a90 7b55e61a07304f9cab1544260dcd3e41
  e21242f450d948d7af2650ac9365ee36 - - -] (compute02, 8904aeeb-a35b-4ba3
  -848a-73269fdde4d3) ram: 4096MB disk: 849920MB io_ops: 0 instances: 0
  has not been heard from in a while

  [1] https://github.com/openstack/nova/blob/stable/ocata/nova/compute/resource_tracker.py#L464
  [2] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L630
  [3] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L410
  [4] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/filter_scheduler.py#L183

To manage notifications about this bug go to:
https://bugs.launchpad.net/ironic/+bug/1714248/+subscriptions
References

[Bug 1714248] [NEW] Compute node HA for ironic doesn't work due to the name duplication of Resource Provider
From: Hironori Shiina, 2017-08-31