← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1834048] [NEW] Nova waits indefinitely on ceph client hangs due to network problems

 

Public bug reported:

Description
===========
Requested to be filed by sean-k-mooney as "not a ceph problem".

During what looks like the update_available_resource process, queries to
ceph are made to check available space, etc. In cases where there is
packet loss between the compute node and ceph, the ceph client may hang
for up to 30 seconds per dropped request.

This freezes up nova's queue and enough sequential failures will
eventually shows up with a symptom of "too many missed heartbeats"
rabbitmq error, which interrupts and restarts the cycle over again.

As suggested by Sean, it might be best to put a configurable timeout on
ceph calls during this process to ensure nova doesnt lock up/flap, and
ceph backend network issues are reported for debug.

Steps to reproduce
==================
1. introduce a silent failure of ceph client, oneway packet loss via mismatched LACP MTU across switches, bad triangular routing, flapping links, etc.
2. observe symptom of nova hanging long enough to miss 60 seconds of rabbitmq heartbeats, debug hanging on update_available_resource /var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/resource_tracker.py:704

Expected result
===============
nova alerting of ceph connection timeout

Actual result
=============
nova hangs for 60 seconds, while being in "up" state, flapping for a couple seconds every 60 seconds as it hits the rabbitmq error and reconnects, but is in non-functional state and ignores all instructions on the messagebus.

Environment
===========
nova==18.1.0
rocky

Logs & Configs
==============
No direct logs other than rabbitmq's complaints of timeouts as a symptom.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1834048

Title:
  Nova waits indefinitely on ceph client hangs due to network problems

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  Requested to be filed by sean-k-mooney as "not a ceph problem".

  During what looks like the update_available_resource process, queries
  to ceph are made to check available space, etc. In cases where there
  is packet loss between the compute node and ceph, the ceph client may
  hang for up to 30 seconds per dropped request.

  This freezes up nova's queue and enough sequential failures will
  eventually shows up with a symptom of "too many missed heartbeats"
  rabbitmq error, which interrupts and restarts the cycle over again.

  As suggested by Sean, it might be best to put a configurable timeout
  on ceph calls during this process to ensure nova doesnt lock up/flap,
  and ceph backend network issues are reported for debug.

  Steps to reproduce
  ==================
  1. introduce a silent failure of ceph client, oneway packet loss via mismatched LACP MTU across switches, bad triangular routing, flapping links, etc.
  2. observe symptom of nova hanging long enough to miss 60 seconds of rabbitmq heartbeats, debug hanging on update_available_resource /var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/resource_tracker.py:704

  Expected result
  ===============
  nova alerting of ceph connection timeout

  Actual result
  =============
  nova hangs for 60 seconds, while being in "up" state, flapping for a couple seconds every 60 seconds as it hits the rabbitmq error and reconnects, but is in non-functional state and ignores all instructions on the messagebus.

  Environment
  ===========
  nova==18.1.0
  rocky

  Logs & Configs
  ==============
  No direct logs other than rabbitmq's complaints of timeouts as a symptom.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1834048/+subscriptions


Follow ups