yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1999674] [NEW] nova compute service does not reset instance with task_state in rebooting_hard

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Pierre-Samuel LE STANG <1999674@xxxxxxxxxxxxxxxxxx>
Date: Wed, 14 Dec 2022 16:59:16 -0000
Reply-to: Bug 1999674 <1999674@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

Description
===========
When a user ask for a reboot hard of a running instance while nova compute is unavailable (service stopped or host down) it might happens under certain conditions that the instance stays in rebooting_hard task_state after nova-compute start again.

The condition to get this issue is to have a rabbitmq message-ttl of
messages in queue which is lower than the time needed to get nova
compute up again.


Steps to reproduce
==================

Prerequisites:
* Set a low message-ttl (let's say 60 seconds) in your rabbitmq 
* Have a running instance on a host

First case is having a failure on nova-compute service
1/ stop nova compute service on host
2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
3/ wait 60 seconds
4/ start nova compute service
5/ check instance task_state and status 

Second case is having a failure on the host
1/ hard shutdown the host (let's say a power supply issue)
2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
3/ wait 60 seconds
2/ restart the host
5/ check instance task_state and status 


Expected result
===============
We expect nova compute to be able to reset the state to active as we lost the message, to let the user take some other actions on the instance.

Actual result
=============
The instance is stuck in rebooting_hard task_state, user is blocked

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1999674

Title:
  nova compute service does not reset instance with task_state in
  rebooting_hard

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When a user ask for a reboot hard of a running instance while nova compute is unavailable (service stopped or host down) it might happens under certain conditions that the instance stays in rebooting_hard task_state after nova-compute start again.

  The condition to get this issue is to have a rabbitmq message-ttl of
  messages in queue which is lower than the time needed to get nova
  compute up again.

  
  Steps to reproduce
  ==================

  Prerequisites:
  * Set a low message-ttl (let's say 60 seconds) in your rabbitmq 
  * Have a running instance on a host

  First case is having a failure on nova-compute service
  1/ stop nova compute service on host
  2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
  3/ wait 60 seconds
  4/ start nova compute service
  5/ check instance task_state and status 

  Second case is having a failure on the host
  1/ hard shutdown the host (let's say a power supply issue)
  2/ ask for a reboot hard: openstack server reboot --hard <instance_id>
  3/ wait 60 seconds
  2/ restart the host
  5/ check instance task_state and status 

  
  Expected result
  ===============
  We expect nova compute to be able to reset the state to active as we lost the message, to let the user take some other actions on the instance.

  Actual result
  =============
  The instance is stuck in rebooting_hard task_state, user is blocked

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1999674/+subscriptions

Follow ups

[Bug 1999674] Re: nova compute service does not reset instance with task_state in rebooting_hard
From: OpenStack Infra, 2024-03-20