← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2121607] [NEW] Nova-api showing latency after upgrading to Caracal

 

Public bug reported:

After upgrading to Caracal, we noticed the duration of GET calls to
nova-api is increasing over time, and same for the memory usage of nova-
api. We first noticed that in telegraf metrics, to validate that, I
created a brand new cluster of VMs without telegraf, with only one
headnode running nova-api, and have multiple nodes sending GET request
to that and monitor the duration.

Script to send requests:
# --- Get a fresh token (requires openrc sourced first) ---
get_token() {
  openstack token issue -f value -c id
}
OS_TOKEN=$(get_token)
echo "Using token: $OS_TOKEN"

# --- Send requests at 10 per second ---
COUNT=0
while true; do
  COUNT=$((COUNT+1))
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    -H "X-Auth-Token: $OS_TOKEN" \
    -H "Accept: application/json" \
    "$NOVA_URL/servers/detail")

  echo "$(date +'%F %T') [$COUNT] HTTP $STATUS"

  if [ "$STATUS" = "401" ]; then
    echo "[$(date)] Got 401 → refreshing token..."
    OS_TOKEN=$(get_token)
    continue   # retry next loop with fresh token
  fi

  sleep 0.1   # 0.1 sec → 10 per second
done

script to monitor the duration (avg per 5 minutes)
grep 'servers/detail' /var/log/nova/nova-api.log | awk '
    # Example line:
  # 2025-08-21 17:27:08.859 ... "GET /v2.1/os-quota-sets/..." ... time: 0.6598654
  match($0, /^([0-9-]+) ([0-9]{2}):([0-9]{2}):([0-9]{2})(\.[0-9]+)?.* time: ([0-9.]+)/, m) {
      ymd = m[1]; hh = m[2]; mm = m[3]; dur = m[6]
      bmin = int(mm/5)*5                           # floor minute to 5-min bucket
      key = sprintf("%s %s:%02d", ymd, hh, bmin)   # e.g., 2025-08-21 17:25
      sum[key] += dur; cnt[key]++
  }
  END {
      for (k in sum) printf "%s,%.3f\n", k, sum[k]/cnt[k] | "sort"
  }'

I use systemctl status to track the memory usage, it increased about
500MB during a weekend (I'm testing on a small cluster). The duration of
the GET request also showed obvious increment, and seems no restriction
limit.

Wondering if it is a memory leak thing, but want to get confirmation
from team. Thanks.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2121607

Title:
  Nova-api showing latency after upgrading to Caracal

Status in OpenStack Compute (nova):
  New

Bug description:
  After upgrading to Caracal, we noticed the duration of GET calls to
  nova-api is increasing over time, and same for the memory usage of
  nova-api. We first noticed that in telegraf metrics, to validate that,
  I created a brand new cluster of VMs without telegraf, with only one
  headnode running nova-api, and have multiple nodes sending GET request
  to that and monitor the duration.

  Script to send requests:
  # --- Get a fresh token (requires openrc sourced first) ---
  get_token() {
    openstack token issue -f value -c id
  }
  OS_TOKEN=$(get_token)
  echo "Using token: $OS_TOKEN"

  # --- Send requests at 10 per second ---
  COUNT=0
  while true; do
    COUNT=$((COUNT+1))
    STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
      -H "X-Auth-Token: $OS_TOKEN" \
      -H "Accept: application/json" \
      "$NOVA_URL/servers/detail")

    echo "$(date +'%F %T') [$COUNT] HTTP $STATUS"

    if [ "$STATUS" = "401" ]; then
      echo "[$(date)] Got 401 → refreshing token..."
      OS_TOKEN=$(get_token)
      continue   # retry next loop with fresh token
    fi

    sleep 0.1   # 0.1 sec → 10 per second
  done

  script to monitor the duration (avg per 5 minutes)
  grep 'servers/detail' /var/log/nova/nova-api.log | awk '
      # Example line:
    # 2025-08-21 17:27:08.859 ... "GET /v2.1/os-quota-sets/..." ... time: 0.6598654
    match($0, /^([0-9-]+) ([0-9]{2}):([0-9]{2}):([0-9]{2})(\.[0-9]+)?.* time: ([0-9.]+)/, m) {
        ymd = m[1]; hh = m[2]; mm = m[3]; dur = m[6]
        bmin = int(mm/5)*5                           # floor minute to 5-min bucket
        key = sprintf("%s %s:%02d", ymd, hh, bmin)   # e.g., 2025-08-21 17:25
        sum[key] += dur; cnt[key]++
    }
    END {
        for (k in sum) printf "%s,%.3f\n", k, sum[k]/cnt[k] | "sort"
    }'

  I use systemctl status to track the memory usage, it increased about
  500MB during a weekend (I'm testing on a small cluster). The duration
  of the GET request also showed obvious increment, and seems no
  restriction limit.

  Wondering if it is a memory leak thing, but want to get confirmation
  from team. Thanks.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2121607/+subscriptions