← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2091147] [NEW] Nova prematurely interprets 'in shutdown' state as 'shutdown successful' for VMs with PCI passthrough devices, hindering graceful shutdown

 

Public bug reported:

Nova's instance shutdown logic prematurely interprets the 'in shutdown'
state as 'shutdown successful', interfering with the graceful shutdown
process and potentially causing issues.

When a stop command is issued for VMs using PCI passthrough (e.g., GPUs), the shutdown process can take considerably longer than for traditional VMs
- 4 GPU, 1TB memory VM: ~ 1 minute 20 seconds for shutdown
- 8 GPU, 2TB memory VM: ~ 2 minutes 10 seconds for shutdown

The current issue is that Nova is interpreting the 'in shutdown' state (where the shutdown is still in progress) as 'shutdown successful' too early. This premature interpretation prevents the graceful shutdown logic from completing properly, potentially triggering destroy attempts before the shutdown process is fully complete. This can result in errors such as:
  " Cannot destroy instance, general system call failure: libvirt.libvirtError: Failed to terminate process 1910551 with SIGKILL: Device or resource busy "

This behavior prevents the effective use of the shutdown_timeout and os_shutdown_timeout settings, 
which are designed to allow for graceful shutdowns. 
By misinterpreting the 'in shutdown' state, Nova may initiate destroy operations too early, leading to potential data integrity issues and abnormal terminations.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2091147

Title:
  Nova prematurely interprets 'in shutdown' state as 'shutdown
  successful' for VMs with PCI passthrough devices, hindering graceful
  shutdown

Status in OpenStack Compute (nova):
  New

Bug description:
  Nova's instance shutdown logic prematurely interprets the 'in
  shutdown' state as 'shutdown successful', interfering with the
  graceful shutdown process and potentially causing issues.

  When a stop command is issued for VMs using PCI passthrough (e.g., GPUs), the shutdown process can take considerably longer than for traditional VMs
  - 4 GPU, 1TB memory VM: ~ 1 minute 20 seconds for shutdown
  - 8 GPU, 2TB memory VM: ~ 2 minutes 10 seconds for shutdown

  The current issue is that Nova is interpreting the 'in shutdown' state (where the shutdown is still in progress) as 'shutdown successful' too early. This premature interpretation prevents the graceful shutdown logic from completing properly, potentially triggering destroy attempts before the shutdown process is fully complete. This can result in errors such as:
    " Cannot destroy instance, general system call failure: libvirt.libvirtError: Failed to terminate process 1910551 with SIGKILL: Device or resource busy "

  This behavior prevents the effective use of the shutdown_timeout and os_shutdown_timeout settings, 
  which are designed to allow for graceful shutdowns. 
  By misinterpreting the 'in shutdown' state, Nova may initiate destroy operations too early, leading to potential data integrity issues and abnormal terminations.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2091147/+subscriptions