yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1767139] Re: TypeError in _get_inventory_and_update_provider_generation

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1767139@xxxxxxxxxxxxxxxxxx>
Date: Thu, 03 May 2018 22:52:47 -0000
Reply-to: Bug 1767139 <1767139@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Reviewed:  https://review.openstack.org/566096
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=80a001989351d3d427c204c8c06cfacc964f2a35
Submitter: Zuul
Branch:    master

commit 80a001989351d3d427c204c8c06cfacc964f2a35
Author: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date:   Thu May 3 11:21:47 2018 -0400

    Handle @safe_connect returns None side effect in _ensure_resource_provider
    
    Change I0c4ca6a81f213277fe7219cb905a805712f81e36 added more error
    handling to the _ensure_resource_provider flow but didn't account
    for @safe_connect returning None when calling _create_resource_provider
    in the case that nova-compute is started before placement is running.
    If that happens, we fail with a TypeError during the nova-compute
    startup because we put None in the resource provider cache and then
    later blindly try to use it because the compute node resource provider
    uuid is in the cache, but mapped to None.
    
    This adds the None check back in _ensure_resource_provider and if
    None is returned from _create_resource_provider we raise the same
    exception that _create_resource_provider would raise if it couldn't
    create the provider.
    
    Change-Id: If9e1581db9c1ae14340b787d03c815d243d5a50c
    Closes-Bug: #1767139


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1767139

Title:
  TypeError in _get_inventory_and_update_provider_generation

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) pike series:
  In Progress
Status in OpenStack Compute (nova) queens series:
  In Progress

Bug description:
  Description
  ===========

  Bringing up a new cluster as part of our CI after switch from 16.1.0
  to 16.1.1 on Centos, I'm seeing this error on some computes:

  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager Traceback (most recent call last):
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6752, in update_available_resource_for_node
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     rt.update_available_resource(context, nodename)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 704, in update_available_resource
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     self._update_available_resource(context, resources)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     return f(*args, **kwargs)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 728, in _update_available_resource
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     self._init_compute_node(context, resources)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 585, in _init_compute_node
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     self._update(context, cn)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 886, in _update
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     inv_data,
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 64, in set_inventory_for_provider
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     inv_data,
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     return getattr(self.instance, __name)(*args, **kwargs)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 789, in set_inventory_for_provider
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     self._update_inventory(rp_uuid, inv_data)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 56, in wrapper
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     return f(self, *a, **k)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 675, in _update_inventory
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     if self._update_inventory_attempt(rp_uuid, inv_data):
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 562, in _update_inventory_attempt
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     curr = self._get_inventory_and_update_provider_generation(rp_uuid)
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/scheduler/client/report.py", line 546, in _get_inventory_and_update_provider_generation
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager     if server_gen != my_rp['generation']:
  2018-04-26 13:36:26.580 14536 ERROR nova.compute.manager TypeError: 'NoneType' object has no attribute '__getitem__'

  The error seems persistent for a single run of nova-compute.

  Steps to reproduce
  ==================

  Nodes were started by our CI infrastructure.  We start 3 computes and
  a single control node.  In 50% of cases, one of the computes comes up
  in this bad state.

  Expected result
  ===============

  Working cluster.

  Actual result
  =============

  At least one of 3 nodes fails to join the cluster, it's not picked up
  by discover_hosts and I see the above stack trace repeated in the
  nova-compute logs.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

  $ rpm -qa | grep nova
  python-nova-16.1.1-1.el7.noarch
  openstack-nova-common-16.1.1-1.el7.noarch
  python2-novaclient-9.1.1-1.el7.noarch
  openstack-nova-api-16.1.1-1.el7.noarch
  openstack-nova-compute-16.1.1-1.el7.noarch

  
  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

  $ rpm -qa | grep kvm
  libvirt-daemon-kvm-3.2.0-14.el7_4.9.x86_64
  qemu-kvm-common-ev-2.9.0-16.el7_4.14.1.x86_64
  qemu-kvm-ev-2.9.0-16.el7_4.14.1.x86_64

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

  Not sure

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

  Neutron with Calico (I work on Calico, this is our CI system)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1767139/+subscriptions
References

[Bug 1767139] [NEW] TypeError in _get_inventory_and_update_provider_generation
From: Shaun Crampton, 2018-04-26