← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1750450] [NEW] ironic: n-cpu fails to recover after losing connection to ironic-api and placement-api

 

Public bug reported:

The ironic virt driver does some crazy things when the ironic API goes
down - it returns [] from get_available_nodes(). When the resource
tracker sees this, it immediately attempts to delete all of the compute
node records and resource providers for said nodes.

If placement is also down at this time, the resource providers will not
be properly deleted.

When ironic-api and placement-api return, nova will see nodes, create
compute_node records for them, and try to create new resource providers
(as they are new compute_node records). This will fail with a name
conflict, and the nodes will be unusable.

This is easy to fix, by raising an exception in get_available_nodes,
instead of lying to the resource tracker and returning []. However, this
causes nova-compute to fail to start if ironic-api is not available.

This may be fine but should have a larger discussion. We've added these
hacks over the years for some reason, we should look at the bigger
picture and decide how we want to handle these cases.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1750450

Title:
  ironic: n-cpu fails to recover after losing connection to ironic-api
  and placement-api

Status in OpenStack Compute (nova):
  New

Bug description:
  The ironic virt driver does some crazy things when the ironic API goes
  down - it returns [] from get_available_nodes(). When the resource
  tracker sees this, it immediately attempts to delete all of the
  compute node records and resource providers for said nodes.

  If placement is also down at this time, the resource providers will
  not be properly deleted.

  When ironic-api and placement-api return, nova will see nodes, create
  compute_node records for them, and try to create new resource
  providers (as they are new compute_node records). This will fail with
  a name conflict, and the nodes will be unusable.

  This is easy to fix, by raising an exception in get_available_nodes,
  instead of lying to the resource tracker and returning []. However,
  this causes nova-compute to fail to start if ironic-api is not
  available.

  This may be fine but should have a larger discussion. We've added
  these hacks over the years for some reason, we should look at the
  bigger picture and decide how we want to handle these cases.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1750450/+subscriptions


Follow ups