← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1864122] [NEW] Instances (bare metal) queue for long time when managing a large amount of Ironic nodes

 

Public bug reported:

Description
===========
We have two deployments, one with ~150 bare metal nodes, and another with ~300. These are each managed by one nova-compute process running the Ironic driver. After upgrading from the Ocata release, we noticed that instance launches would be stuck in the spawning state for a long time, up to 30 minutes to an hour in some cases.

After investigation, the root cause appeared to be contention between
the update_resources periodic task and the instance claim step. There is
one semaphore "compute_resources" that is used to control every access
within the resource_tracker. In our case, what was happening was the
update_resources job, which runs every minute by default, was constantly
queuing up accesses to this semaphore, because each hypervisor is
updated independently, in series. This meant that, for us, each Ironic
node was being processed and was holding the semaphore during its update
(which took about 2-5 seconds in practice.) Multiply this by 150 and our
update task was running constantly. Because an instance claim also needs
to access this semaphore, this led to instances getting stuck in the
"Build" state, after scheduling, for tens of minutes on average. There
seemed to be some probabilistic effect here, which I hypothesize is
related to the locking mechanism not using a "fair" lock (first-come,
first-served) by default.

Steps to reproduce
==================
I suspect this is only visible on deployments of >100 Ironic nodes or so (and, they have to be managed by one nova-compute-ironic service.) Due to the non-deterministic nature of the lock, the behavior is sporadic, but launching an instance is enough to observe the behavior.

Expected result
===============
Instance proceeds to networking phase of creation after <60 seconds.

Actual result
=============
Instance stuck in BUILD state for 30-60 minutes before proceeding to networking phase.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/
   Nova 20.0.1

2. Which hypervisor did you use?
   Ironic

2. Which storage type did you use?
   N/A

3. Which networking type did you use?
   Neutron/OVS

Logs & Configs
==============

Links
=====
First report, on openstack-discuss: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/006192.html

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1864122

Title:
  Instances (bare metal) queue for long time when managing a large
  amount of Ironic nodes

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  We have two deployments, one with ~150 bare metal nodes, and another with ~300. These are each managed by one nova-compute process running the Ironic driver. After upgrading from the Ocata release, we noticed that instance launches would be stuck in the spawning state for a long time, up to 30 minutes to an hour in some cases.

  After investigation, the root cause appeared to be contention between
  the update_resources periodic task and the instance claim step. There
  is one semaphore "compute_resources" that is used to control every
  access within the resource_tracker. In our case, what was happening
  was the update_resources job, which runs every minute by default, was
  constantly queuing up accesses to this semaphore, because each
  hypervisor is updated independently, in series. This meant that, for
  us, each Ironic node was being processed and was holding the semaphore
  during its update (which took about 2-5 seconds in practice.) Multiply
  this by 150 and our update task was running constantly. Because an
  instance claim also needs to access this semaphore, this led to
  instances getting stuck in the "Build" state, after scheduling, for
  tens of minutes on average. There seemed to be some probabilistic
  effect here, which I hypothesize is related to the locking mechanism
  not using a "fair" lock (first-come, first-served) by default.

  Steps to reproduce
  ==================
  I suspect this is only visible on deployments of >100 Ironic nodes or so (and, they have to be managed by one nova-compute-ironic service.) Due to the non-deterministic nature of the lock, the behavior is sporadic, but launching an instance is enough to observe the behavior.

  Expected result
  ===============
  Instance proceeds to networking phase of creation after <60 seconds.

  Actual result
  =============
  Instance stuck in BUILD state for 30-60 minutes before proceeding to networking phase.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/
     Nova 20.0.1

  2. Which hypervisor did you use?
     Ironic

  2. Which storage type did you use?
     N/A

  3. Which networking type did you use?
     Neutron/OVS

  Logs & Configs
  ==============

  Links
  =====
  First report, on openstack-discuss: http://lists.openstack.org/pipermail/openstack-discuss/2019-May/006192.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1864122/+subscriptions


Follow ups