← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2065927] Re: cpu power management can fail with OSError: [Errno 16] Device or resource busy

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/920203
Committed: https://opendev.org/openstack/nova/commit/44c1b48b3121682cf959c90b3adaf2a3f92e318c
Submitter: "Zuul (22348)"
Branch:    master

commit 44c1b48b3121682cf959c90b3adaf2a3f92e318c
Author: Sean Mooney <work@xxxxxxxxxxxxxxx>
Date:   Wed May 22 18:59:02 2024 +0100

    retry write_sys call on device busy
    
    This change adds a retry_if_busy decorator
    to the read_sys and write_sys functions in the filesystem
    module that will retry reads and writes up to 5 times with
    an linear backoff.
    
    This allows nova to tolerate short periods of time where
    sysfs retruns device busy. If the reties are exausted
    and offlineing a core fails a warning is log and the failure is
    ignored. onling a core is always treated as a hard error if
    retries are exausted.
    
    Closes-Bug: #2065927
    Change-Id: I2a6a9f243cb403167620405e167a8dd2bbf3fa79


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2065927

Title:
  cpu power management can fail  with OSError: [Errno 16] Device or
  resource busy

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) 2024.1 series:
  Triaged
Status in OpenStack Compute (nova) antelope series:
  Triaged
Status in OpenStack Compute (nova) bobcat series:
  Triaged

Bug description:
  as reported downstream in https://issues.redhat.com/browse/OSPRH-7103

  if you create a vm, reboot the host, start the vm, 
  and finally delete it.

  that may fail

  May 16 15:54:26 edpm-compute-0 nova_compute[3396]: Traceback (most recent call last):
  May 16 15:54:26 edpm-compute-0 nova_compute[3396]:   File "/usr/lib/python3.9/site-packages/nova/filesystem.py", line 57, in write_sys
  May 16 15:54:26 edpm-compute-0 nova_compute[3396]:     fd.write(data)
  May 16 15:54:26 edpm-compute-0 nova_compute[3396]: OSError: [Errno 16] Device or resource busy

  this prevents the VM from being deleted on the inial request but it
  can then be deleted if you try again

  this race condition with the kernel is unlikely to happen and appeared
  to be timing related.

  i.e. there is a short period of time where onlineing or offlining of a
  CPU may not be possible

  
  to mitigation this nova should retry the operation with a backoff and then eventually squash the error allowing the vm to delete without failing if we cant offline the core.

  
  power management of the core should never block or cause the vm delete to fail.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2065927/+subscriptions



References