yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #79162
[Bug 1834048] Re: Nova waits indefinitely on ceph client hangs due to network problems
Reviewed: https://review.opendev.org/667421
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03f7dc29b75d1099ef44a034ed7e23d2a4444ac6
Submitter: Zuul
Branch: master
commit 03f7dc29b75d1099ef44a034ed7e23d2a4444ac6
Author: Lee Yarwood <lyarwood@xxxxxxxxxx>
Date: Tue Jun 25 18:20:24 2019 +0100
libvirt: Add a rbd_connect_timeout configurable
Previously the initial call to connect to a RBD cluster via the RADOS
API could hang indefinitely if network or other environmental related
issues were encountered.
When encountered during a call to update_available_resource this can
result in the local n-cpu service reporting as UP while never being able
to break out of a subsequent RPC timeout loop as documented in bug
This change adds a simple timeout configurable to be used when initially
connecting to the cluster [1][2][3]. The default timeout of 5 seconds
being sufficiently small enough to ensure that if encountered the n-cpu
service will be able to be marked as DOWN before a RPC timeout is seen.
[1] http://docs.ceph.com/docs/luminous/rados/api/python/#rados.Rados.connect
[2] http://docs.ceph.com/docs/mimic/rados/api/python/#rados.Rados.connect
[3] http://docs.ceph.com/docs/nautilus/rados/api/python/#rados.Rados.connect
Closes-bug: #1834048
Change-Id: I67f341bf895d6cc5d503da274c089d443295199e
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1834048
Title:
Nova waits indefinitely on ceph client hangs due to network problems
Status in OpenStack Compute (nova):
Fix Released
Bug description:
Description
===========
Requested to be filed by sean-k-mooney as "not a ceph problem".
During what looks like the update_available_resource process, queries
to ceph are made to check available space, etc. In cases where there
is packet loss between the compute node and ceph, the ceph client may
hang for up to 30 seconds per dropped request.
This freezes up nova's queue and enough sequential failures will
eventually shows up with a symptom of "too many missed heartbeats"
rabbitmq error, which interrupts and restarts the cycle over again.
As suggested by Sean, it might be best to put a configurable timeout
on ceph calls during this process to ensure nova doesnt lock up/flap,
and ceph backend network issues are reported for debug.
Steps to reproduce
==================
1. introduce a silent failure of ceph client, oneway packet loss via mismatched LACP MTU across switches, bad triangular routing, flapping links, etc.
2. observe symptom of nova hanging long enough to miss 60 seconds of rabbitmq heartbeats, debug hanging on update_available_resource /var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/resource_tracker.py:704
Expected result
===============
nova alerting of ceph connection timeout
Actual result
=============
nova hangs for 60 seconds, while being in "up" state, flapping for a couple seconds every 60 seconds as it hits the rabbitmq error and reconnects, but is in non-functional state and ignores all instructions on the messagebus.
Environment
===========
nova==18.1.0
rocky
Logs & Configs
==============
No direct logs other than rabbitmq's complaints of timeouts as a symptom.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1834048/+subscriptions
References