yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1742827] Re: nova-scheduler reports dead compute nodes but nova-compute is enabled and up

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1742827@xxxxxxxxxxxxxxxxxx>
Date: Sat, 27 Jan 2018 17:17:02 -0000
Reply-to: Bug 1742827 <1742827@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Reviewed:  https://review.openstack.org/533371
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c98ac6adc561d70d34c724703a437b8435e6ddfa
Submitter: Zuul
Branch:    master

commit c98ac6adc561d70d34c724703a437b8435e6ddfa
Author: melanie witt <melwittt@xxxxxxxxx>
Date:   Sat Jan 13 21:49:54 2018 +0000

    Stop globally caching host states in scheduler HostManager
    
    Currently, in the scheduler HostManager, we cache host states in
    a map global to all requests. This used to be okay because we were
    always querying the entire compute node list for every request to
    pass on to filtering. So we cached the host states globally and
    updated them per request and removed "dead nodes" from the cache
    (compute nodes still in the cache that were not returned from
    ComputeNodeList.get_all).
    
    As of Ocata, we started filtering our ComputeNodeList query based on
    an answer from placement about which resource providers could satisfy
    the request, instead of querying the entire compute node list every
    time. This is much more efficient (don't consider compute nodes that
    can't possibly fulfill the request) BUT it doesn't play well with the
    global host state cache. We started seeing "Removing dead compute node"
    messages in the logs, signaling removal of compute nodes from the
    global cache when compute nodes were actually available.
    
    If request A comes in and all compute nodes can satisfy its request,
    then request B arrives concurrently and no compute nodes can satisfy
    its request, the request B request will remove all the compute nodes
    from the global host state cache and then request A will get "no valid
    hosts" at the filtering stage because get_host_states_by_uuids returns
    a generator that hands out hosts from the global host state cache.
    
    This removes the global host state cache from the scheduler HostManager
    and instead generates a fresh host state map per request and uses that
    to return hosts from the generator. Because we're filtering the
    ComputeNodeList based on a placement query per request, each request
    can have a completely different set of compute nodes that can fulfill
    it, so we're not gaining much by caching host states anyway.
    
    Co-Authored-By: Dan Smith <dansmith@xxxxxxxxxx>
    
    Closes-Bug: #1742827
    Related-Bug: #1739323
    
    Change-Id: I40c17ed88f50ecbdedc4daf368fff10e90e7be11


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1742827

Title:
  nova-scheduler reports dead compute nodes but nova-compute is enabled
  and up

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  (originally reported by David Manchado in
  https://bugzilla.redhat.com/show_bug.cgi?id=1533196 )

  Description of problem:
  We are seeing that nova scheduler is removing compute nodes because it considers them as dead but openstack compute service list reports nova-compute to be up an running.
  We can see in nova-scheduler entries with the following pattern:
  - Removing dead compute node XXX from scheduler
  - Filter ComputeFilter returned 0 hosts
  - Filtering removed all hosts for the request with instance ID '11feeba9-f46c-416d-a97e-7c0c9d565b5a'. Filter results: ['AggregateInstanceExtraSpecsFilter: (start: 19, end: 2)', 'AggregateCoreFilter: (start: 2, end: 2)', 'AggregateDiskFilter: (start: 2, end: 2)', 'AggregateRamFilter: (start: 2, end: 2)', 'RetryFilter: (start: 2, end: 2)', 'AvailabilityZoneFilter: (start: 2, end: 2)', 'ComputeFilter: (start: 2, end: 0)']

  Version-Release number of selected component (if applicable):
  Ocata

  How reproducible:
  N/A

  Actual results:
  Instances are not being spawned reporting 'no valid host found' because of 

  Additional info:
  This has been happening for a week.
  We did an upgrade from Newton three weeks ago.
  We have also done a minor update and the issue still persists.

  Nova related RPMs
  openstack-nova-scheduler-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  python2-novaclient-7.1.2-1.el7.noarch
  openstack-nova-novncproxy-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  openstack-nova-cert-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  openstack-nova-console-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  openstack-nova-conductor-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  openstack-nova-common-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  openstack-nova-compute-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  openstack-nova-placement-api-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  puppet-nova-10.4.2-0.20180102233330.f4bc1f0.el7.centos.noarch
  openstack-nova-api-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
  python-nova-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1742827/+subscriptions
References

[Bug 1742827] [NEW] nova-scheduler reports dead compute nodes but nova-compute is enabled and up
From: Alan Pevec, 2018-01-12