yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #70672
[Bug 1742827] Re: nova-scheduler reports dead compute nodes but nova-compute is enabled and up
Reviewed: https://review.openstack.org/533371
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c98ac6adc561d70d34c724703a437b8435e6ddfa
Submitter: Zuul
Branch: master
commit c98ac6adc561d70d34c724703a437b8435e6ddfa
Author: melanie witt <melwittt@xxxxxxxxx>
Date: Sat Jan 13 21:49:54 2018 +0000
Stop globally caching host states in scheduler HostManager
Currently, in the scheduler HostManager, we cache host states in
a map global to all requests. This used to be okay because we were
always querying the entire compute node list for every request to
pass on to filtering. So we cached the host states globally and
updated them per request and removed "dead nodes" from the cache
(compute nodes still in the cache that were not returned from
ComputeNodeList.get_all).
As of Ocata, we started filtering our ComputeNodeList query based on
an answer from placement about which resource providers could satisfy
the request, instead of querying the entire compute node list every
time. This is much more efficient (don't consider compute nodes that
can't possibly fulfill the request) BUT it doesn't play well with the
global host state cache. We started seeing "Removing dead compute node"
messages in the logs, signaling removal of compute nodes from the
global cache when compute nodes were actually available.
If request A comes in and all compute nodes can satisfy its request,
then request B arrives concurrently and no compute nodes can satisfy
its request, the request B request will remove all the compute nodes
from the global host state cache and then request A will get "no valid
hosts" at the filtering stage because get_host_states_by_uuids returns
a generator that hands out hosts from the global host state cache.
This removes the global host state cache from the scheduler HostManager
and instead generates a fresh host state map per request and uses that
to return hosts from the generator. Because we're filtering the
ComputeNodeList based on a placement query per request, each request
can have a completely different set of compute nodes that can fulfill
it, so we're not gaining much by caching host states anyway.
Co-Authored-By: Dan Smith <dansmith@xxxxxxxxxx>
Closes-Bug: #1742827
Related-Bug: #1739323
Change-Id: I40c17ed88f50ecbdedc4daf368fff10e90e7be11
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1742827
Title:
nova-scheduler reports dead compute nodes but nova-compute is enabled
and up
Status in OpenStack Compute (nova):
Fix Released
Bug description:
(originally reported by David Manchado in
https://bugzilla.redhat.com/show_bug.cgi?id=1533196 )
Description of problem:
We are seeing that nova scheduler is removing compute nodes because it considers them as dead but openstack compute service list reports nova-compute to be up an running.
We can see in nova-scheduler entries with the following pattern:
- Removing dead compute node XXX from scheduler
- Filter ComputeFilter returned 0 hosts
- Filtering removed all hosts for the request with instance ID '11feeba9-f46c-416d-a97e-7c0c9d565b5a'. Filter results: ['AggregateInstanceExtraSpecsFilter: (start: 19, end: 2)', 'AggregateCoreFilter: (start: 2, end: 2)', 'AggregateDiskFilter: (start: 2, end: 2)', 'AggregateRamFilter: (start: 2, end: 2)', 'RetryFilter: (start: 2, end: 2)', 'AvailabilityZoneFilter: (start: 2, end: 2)', 'ComputeFilter: (start: 2, end: 0)']
Version-Release number of selected component (if applicable):
Ocata
How reproducible:
N/A
Actual results:
Instances are not being spawned reporting 'no valid host found' because of
Additional info:
This has been happening for a week.
We did an upgrade from Newton three weeks ago.
We have also done a minor update and the issue still persists.
Nova related RPMs
openstack-nova-scheduler-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
python2-novaclient-7.1.2-1.el7.noarch
openstack-nova-novncproxy-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-cert-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-console-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-conductor-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-common-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-compute-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
openstack-nova-placement-api-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
puppet-nova-10.4.2-0.20180102233330.f4bc1f0.el7.centos.noarch
openstack-nova-api-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
python-nova-15.1.1-0.20180103153502.ff2231f.el7.centos.noarch
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1742827/+subscriptions
References