← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1834048] Re: Nova waits indefinitely on ceph client hangs due to network problems

 

Reviewed:  https://review.opendev.org/667421
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03f7dc29b75d1099ef44a034ed7e23d2a4444ac6
Submitter: Zuul
Branch:    master

commit 03f7dc29b75d1099ef44a034ed7e23d2a4444ac6
Author: Lee Yarwood <lyarwood@xxxxxxxxxx>
Date:   Tue Jun 25 18:20:24 2019 +0100

    libvirt: Add a rbd_connect_timeout configurable
    
    Previously the initial call to connect to a RBD cluster via the RADOS
    API could hang indefinitely if network or other environmental related
    issues were encountered.
    
    When encountered during a call to update_available_resource this can
    result in the local n-cpu service reporting as UP while never being able
    to break out of a subsequent RPC timeout loop as documented in bug
    
    This change adds a simple timeout configurable to be used when initially
    connecting to the cluster [1][2][3]. The default timeout of 5 seconds
    being sufficiently small enough to ensure that if encountered the n-cpu
    service will be able to be marked as DOWN before a RPC timeout is seen.
    
    [1] http://docs.ceph.com/docs/luminous/rados/api/python/#rados.Rados.connect
    [2] http://docs.ceph.com/docs/mimic/rados/api/python/#rados.Rados.connect
    [3] http://docs.ceph.com/docs/nautilus/rados/api/python/#rados.Rados.connect
    
    Closes-bug: #1834048
    Change-Id: I67f341bf895d6cc5d503da274c089d443295199e


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1834048

Title:
  Nova waits indefinitely on ceph client hangs due to network problems

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===========
  Requested to be filed by sean-k-mooney as "not a ceph problem".

  During what looks like the update_available_resource process, queries
  to ceph are made to check available space, etc. In cases where there
  is packet loss between the compute node and ceph, the ceph client may
  hang for up to 30 seconds per dropped request.

  This freezes up nova's queue and enough sequential failures will
  eventually shows up with a symptom of "too many missed heartbeats"
  rabbitmq error, which interrupts and restarts the cycle over again.

  As suggested by Sean, it might be best to put a configurable timeout
  on ceph calls during this process to ensure nova doesnt lock up/flap,
  and ceph backend network issues are reported for debug.

  Steps to reproduce
  ==================
  1. introduce a silent failure of ceph client, oneway packet loss via mismatched LACP MTU across switches, bad triangular routing, flapping links, etc.
  2. observe symptom of nova hanging long enough to miss 60 seconds of rabbitmq heartbeats, debug hanging on update_available_resource /var/lib/kolla/venv/local/lib/python2.7/site-packages/nova/compute/resource_tracker.py:704

  Expected result
  ===============
  nova alerting of ceph connection timeout

  Actual result
  =============
  nova hangs for 60 seconds, while being in "up" state, flapping for a couple seconds every 60 seconds as it hits the rabbitmq error and reconnects, but is in non-functional state and ignores all instructions on the messagebus.

  Environment
  ===========
  nova==18.1.0
  rocky

  Logs & Configs
  ==============
  No direct logs other than rabbitmq's complaints of timeouts as a symptom.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1834048/+subscriptions


References