yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2095460] [NEW] nova api needs too long for server list

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Max Friedrich <2095460@xxxxxxxxxxxxxxxxxx>
Date: Wed, 22 Jan 2025 06:35:34 -0000
Reply-to: Bug 2095460 <2095460@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
===========
The command openstack server list or nova list (i.e., all commands that query the API endpoint /v2.1/servers/detail) take a very long time to respond.
My setup is deployed using kolla-ansible, consisting of 3 nodes running nova-api, superscheduler, and scheduler; 3 nodes with nova-conductor (cell 1), and 18 compute nodes.
the nova version is 29.2.1. I have already found other threads suggesting an issue with memcached, but none of the proposed solutions have helped.

For a project with around 500 instances, the first execution of
openstack server list results in a 504 gateway timeout (this comes from
haproxy, as the request takes more than 60 seconds). subsequent requests
take between 48 and 52 seconds. these were all performed using a
project-specific user.

Interestingly, when using an admin token (openstack server list
--project %id%), the query completes in "only" 11-15 seconds. I have set
the nova log level to debug but found nothing noteworthy.

The connections to mariadb are not saturated, and the database shows no
signs of overload.

What other steps can I take to investigate the root cause of this issue?

Steps to reproduce
==================
- project with many instances (response times are also significantly too high in my opinion for smaller projects with 100-200 instances)
- openstack server list (within the project context); very long response time
- openstack server list --project %id%; significantly shorter response time

Expected result
===============
The response times should be below 10 seconds

Actual result
=============
See Above

Environment
===========
1. kolla-ansible multinode setup with 2024.1
   nova 29.2.1
2. Libvirt + KVM
3. Ceph 18.2.4
4. Neutron with OVN

Logs & Configs
==============
I have added a debug log file from all 3 Api-Nodes. I replaced the original names with URI for the URL and host 0[1-3].

The request id cfd08275-3808-4969-b76f-cc63be316155 is from a project specific user calling openstack server list.
The request id b4e36393-7fb9-4962-9610-bdaa6196b4ad is from the admin sepcific call.

The nova-api config is kolla standard. I have only changed the number of workers as follows:
nova_api_workers: "{{ [ansible_facts.processor_vcpus // 2, 32] | min }}"
nova_superconductor_workers: "{{ [ansible_facts.processor_vcpus // 2, 32] | min }}"
nova_cell_conductor_workers: "{{ [ansible_facts.processor_vcpus // 2, 32] | min }}"
nova_scheduler_workers: "{{ [ansible_facts.processor_vcpus // 4, 16] | min }}"

This results in 24 worker processes for nova-api.

Thanks a lot for your efforts. If there are any further questions or if
more/different information is needed, I will try to provide it as soon
as possible.

** Affects: nova
     Importance: Undecided
         Status: New

** Attachment added: "nova-api-anonymous.log"
   https://bugs.launchpad.net/bugs/2095460/+attachment/5853344/+files/nova-api-anonymous.log

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2095460

Title:
  nova api needs too long for server list

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  The command openstack server list or nova list (i.e., all commands that query the API endpoint /v2.1/servers/detail) take a very long time to respond.
  My setup is deployed using kolla-ansible, consisting of 3 nodes running nova-api, superscheduler, and scheduler; 3 nodes with nova-conductor (cell 1), and 18 compute nodes.
  the nova version is 29.2.1. I have already found other threads suggesting an issue with memcached, but none of the proposed solutions have helped.

  For a project with around 500 instances, the first execution of
  openstack server list results in a 504 gateway timeout (this comes
  from haproxy, as the request takes more than 60 seconds). subsequent
  requests take between 48 and 52 seconds. these were all performed
  using a project-specific user.

  Interestingly, when using an admin token (openstack server list
  --project %id%), the query completes in "only" 11-15 seconds. I have
  set the nova log level to debug but found nothing noteworthy.

  The connections to mariadb are not saturated, and the database shows
  no signs of overload.

  What other steps can I take to investigate the root cause of this
  issue?

  Steps to reproduce
  ==================
  - project with many instances (response times are also significantly too high in my opinion for smaller projects with 100-200 instances)
  - openstack server list (within the project context); very long response time
  - openstack server list --project %id%; significantly shorter response time

  Expected result
  ===============
  The response times should be below 10 seconds

  Actual result
  =============
  See Above

  Environment
  ===========
  1. kolla-ansible multinode setup with 2024.1
     nova 29.2.1
  2. Libvirt + KVM
  3. Ceph 18.2.4
  4. Neutron with OVN

  Logs & Configs
  ==============
  I have added a debug log file from all 3 Api-Nodes. I replaced the original names with URI for the URL and host 0[1-3].

  The request id cfd08275-3808-4969-b76f-cc63be316155 is from a project specific user calling openstack server list.
  The request id b4e36393-7fb9-4962-9610-bdaa6196b4ad is from the admin sepcific call.

  The nova-api config is kolla standard. I have only changed the number of workers as follows:
  nova_api_workers: "{{ [ansible_facts.processor_vcpus // 2, 32] | min }}"
  nova_superconductor_workers: "{{ [ansible_facts.processor_vcpus // 2, 32] | min }}"
  nova_cell_conductor_workers: "{{ [ansible_facts.processor_vcpus // 2, 32] | min }}"
  nova_scheduler_workers: "{{ [ansible_facts.processor_vcpus // 4, 16] | min }}"

  This results in 24 worker processes for nova-api.

  Thanks a lot for your efforts. If there are any further questions or
  if more/different information is needed, I will try to provide it as
  soon as possible.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2095460/+subscriptions