yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2085709] [NEW] update_available_resource task loads the process by 100% with a large number of instances

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Ivan Tkachuk <2085709@xxxxxxxxxxxxxxxxxx>
Date: Sun, 27 Oct 2024 14:10:11 -0000
Reply-to: Bug 2085709 <2085709@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
===========
When placing many instances (70 or more) on one node, the nova compute process periodically starts to load the processor by 100% (1 core), and all operations (restart, migration, etc.) start to take a very long time. 
After some time, the RabbitMQ starts to break the connection with node because it does not receive a heartbeat from it. And the node is marked as down in the list of hypervisors. 
Over time, the situation gets worse and worse, and the process starts to freeze more and more. 
Restarting the process gives a short-term improvement.
I found out that this happens because of the update_available_resource task, which collects information on all instances. 
When I disabled it 
update_resources_interval = -1 
In the configuration, everything started working as it should, the CPU load is minimal, all operations are performed quickly. 
The nova-compute process is running in one thread and with many simultaneous tasks to collect information from instances, it uses the entire core and freezes.
There are enough processor resources, it is not even 50% loaded. 
Screenshot from top - https://imgur.com/JXcDhS8
Here's an example of the nova processor usage before and after disabling the update_available_resource task - https://imgur.com/qqkhNla
I think this task need to be a separate thread so that it doesn't affect the service when there are a lot of instances. 

Steps to reproduce
==================
create a small flavor to fit 100 instances on the node, and create at least 100 instances. 
openstack flavor create --public m1.extra_tiny --id auto --ram 512 --disk 15 --vcpus 1 
openstack server create --image 618ed5d4-f692-4ce3-af96-542c8ae9926a --network cc50edc1-3435-4854-ae7e-8215568a4249 --flavor m1.extra_tiny  --min 100 --max 100 test-nova

Expected result
===============
Nova-compute continues to work, does not disconnect or freeze. 

Actual result
=============
After some time after launching instances, nova-compute CPU usage periodically increases up to 100% when the process collects information about instances. And any operations take a long time until the task finishes processing. 

Environment
===========
Openstack release 2023.1
Nova-compute 27.5.1
Hypervisor Libvirt + KVM
Storage type - vm files are located on node disks with ext4 file system
CPU - 2xIntel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Networking Neutron with OpenVSwitch

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: performance

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2085709

Title:
  update_available_resource task loads the process by 100% with a large
  number of instances

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When placing many instances (70 or more) on one node, the nova compute process periodically starts to load the processor by 100% (1 core), and all operations (restart, migration, etc.) start to take a very long time. 
  After some time, the RabbitMQ starts to break the connection with node because it does not receive a heartbeat from it. And the node is marked as down in the list of hypervisors. 
  Over time, the situation gets worse and worse, and the process starts to freeze more and more. 
  Restarting the process gives a short-term improvement.
  I found out that this happens because of the update_available_resource task, which collects information on all instances. 
  When I disabled it 
  update_resources_interval = -1 
  In the configuration, everything started working as it should, the CPU load is minimal, all operations are performed quickly. 
  The nova-compute process is running in one thread and with many simultaneous tasks to collect information from instances, it uses the entire core and freezes.
  There are enough processor resources, it is not even 50% loaded. 
  Screenshot from top - https://imgur.com/JXcDhS8
  Here's an example of the nova processor usage before and after disabling the update_available_resource task - https://imgur.com/qqkhNla
  I think this task need to be a separate thread so that it doesn't affect the service when there are a lot of instances. 

  Steps to reproduce
  ==================
  create a small flavor to fit 100 instances on the node, and create at least 100 instances. 
  openstack flavor create --public m1.extra_tiny --id auto --ram 512 --disk 15 --vcpus 1 
  openstack server create --image 618ed5d4-f692-4ce3-af96-542c8ae9926a --network cc50edc1-3435-4854-ae7e-8215568a4249 --flavor m1.extra_tiny  --min 100 --max 100 test-nova

  Expected result
  ===============
  Nova-compute continues to work, does not disconnect or freeze. 

  Actual result
  =============
  After some time after launching instances, nova-compute CPU usage periodically increases up to 100% when the process collects information about instances. And any operations take a long time until the task finishes processing. 

  Environment
  ===========
  Openstack release 2023.1
  Nova-compute 27.5.1
  Hypervisor Libvirt + KVM
  Storage type - vm files are located on node disks with ext4 file system
  CPU - 2xIntel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
  Networking Neutron with OpenVSwitch

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2085709/+subscriptions