← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1815791] Re: Race condition causes Nova to shut off a successfully deployed baremetal server

 

Reviewed:  https://review.openstack.org/636699
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=19cb8280232fd3b0ba0000a475d061ea9fb10e1a
Submitter: Zuul
Branch:    master

commit 19cb8280232fd3b0ba0000a475d061ea9fb10e1a
Author: Jim Rollenhagen <jim@xxxxxxxxxxxxxxxxxx>
Date:   Wed Feb 13 12:59:53 2019 -0500

    ironic: check fresh data when sync_power_state doesn't line up
    
    We return cached data to sync_power_state to avoid pummeling the ironic
    API. However, this can lead to a race condition where an instance is
    powered on, but nova thinks it should be off and calls stop(). Check
    again without the cache when this happens to make sure we don't
    unnecessarily kill an instance.
    
    Closes-Bug: #1815791
    Change-Id: I907b69eb689cf6c169a4869cfc7889308ca419d5


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1815791

Title:
  Race condition causes Nova to shut off a successfully deployed
  baremetal server

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  When booting a baremetal server with Nova, we see Ironic report a
  successful power on:

    ironic-conductor.log:2019-02-13 10:52:15.901 7 INFO
  ironic.conductor.utils [req-774350ce-9392-4096-b66c-20ad3d794e4e
  7a9b1ac45e084e7cbeb9cb740ffe8d08 41ea8af8d00e46438c7be3b182bbb53f -
  default default] Successfully set node a00696d5-32ba-
  475e-9528-59bf11cffea6 power state to power on by power on.

  But seconds later, Nova (a) triggers a power state sync and then (b)
  decided the node is in state "power off" and shuts it down:

  	nova-compute.log:2019-02-13 10:52:17.289 7 DEBUG nova.compute.manager [req-9bcae7d4-4201-40ea-a66c-c5954117f0e4 - - - - -] Triggering sync for uuid dcb4f055-cda4-4d61-ba8f-976645c4e92a _sync_power_states /usr/lib/python2.7/site-packages/nova/compute/manager.py:7516
  	nova-compute.log:2019-02-13 10:52:17.295 7 DEBUG oslo_concurrency.lockutils [-] Lock "dcb4f055-cda4-4d61-ba8f-976645c4e92a" acquired by "nova.compute.manager.query_driver_power_state_and_sync" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:327
  	nova-compute.log:2019-02-13 10:52:17.344 7 WARNING nova.compute.manager [-] [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 4, current VM power_state: 4
  	nova-compute.log:2019-02-13 10:52:17.345 7 DEBUG nova.compute.api [-] [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Going to try to stop instance force_stop /usr/lib/python2.7/site-packages/nova/compute/api.py:2291

  It looks like Nova is using stale cache data to make this decision.

  jroll on irc suggests a solution may look like
  https://review.openstack.org/#/c/636699/ (bypass cache data to verify
  power state before shutting down the server).

  This is with nova @ ad842aa and ironic @ 4404292.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1815791/+subscriptions


References