← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1843050] Re: Regression in the ironic driver: wait_for_active checks a random node

 

Reviewed:  https://review.opendev.org/680684
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6bf04673032da52361a24ca7744c507b91318aec
Submitter: Zuul
Branch:    master

commit 6bf04673032da52361a24ca7744c507b91318aec
Author: Dmitry Tantsur <dtantsur@xxxxxxxxxxxxxx>
Date:   Fri Sep 6 15:16:40 2019 +0200

    Fix incorrect invocation of openstacksdk's baremetal.nodes()
    
    This call accepts keyword arguments, not a dictionary. The dictionary
    ends up passed into the 'details' argument, making it True, so the
    query returns all nodes instead of just one. Fix it and raise an error
    in case it ever repeats.
    
    Closes-Bug: #1843050
    Change-Id: If464423b3c5bc05e2b22ff2245828c677b264d15


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1843050

Title:
  Regression in the ironic driver: wait_for_active checks a random node

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Due to a mistake in the SDK patches, the wait_for_active loop tries to
  check a random node instead of the target node. While it works fine in
  the default job with only one node at all, it fails in the multinode
  grenade.

  The failure [1] looks a bit cryptic:

  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall [-] Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Traceback (most recent call last):
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall   File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall     result = func(*self.args, **self.kw)
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall   File "/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall     raise exception.InstanceNotFound(instance_id=instance.uuid)
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
  Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall
  Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.

  This code path is usually hit when a node suddenly goes into
  AVAILABLE, which clearly shouldn't be the case here. The mistery got
  solved by looking at ironic-api logs [2]:

  [Thu Sep  5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated
  41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1
  switches on core 0)

  This does not match the expected query: GET
  /baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After
  looking at the code more carefully, I've noticed that we're passing
  instance_id and fields inside a dict as a positional argument. In
  reality, they are expected as normal keyword arguments. The dict ends
  up in 'details', which ends up being True. So in reality we're listing
  all nodes and picking the first one.

  [1] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz
  [2] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1843050/+subscriptions


References