yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #79872
[Bug 1843050] Re: Regression in the ironic driver: wait_for_active checks a random node
Reviewed: https://review.opendev.org/680684
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6bf04673032da52361a24ca7744c507b91318aec
Submitter: Zuul
Branch: master
commit 6bf04673032da52361a24ca7744c507b91318aec
Author: Dmitry Tantsur <dtantsur@xxxxxxxxxxxxxx>
Date: Fri Sep 6 15:16:40 2019 +0200
Fix incorrect invocation of openstacksdk's baremetal.nodes()
This call accepts keyword arguments, not a dictionary. The dictionary
ends up passed into the 'details' argument, making it True, so the
query returns all nodes instead of just one. Fix it and raise an error
in case it ever repeats.
Closes-Bug: #1843050
Change-Id: If464423b3c5bc05e2b22ff2245828c677b264d15
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1843050
Title:
Regression in the ironic driver: wait_for_active checks a random node
Status in OpenStack Compute (nova):
Fix Released
Bug description:
Due to a mistake in the SDK patches, the wait_for_active loop tries to
check a random node instead of the target node. While it works fine in
the default job with only one node at all, it fails in the multinode
grenade.
The failure [1] looks a bit cryptic:
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall [-] Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Traceback (most recent call last):
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall raise exception.InstanceNotFound(instance_id=instance.uuid)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall
Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
This code path is usually hit when a node suddenly goes into
AVAILABLE, which clearly shouldn't be the case here. The mistery got
solved by looking at ironic-api logs [2]:
[Thu Sep 5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated
41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1
switches on core 0)
This does not match the expected query: GET
/baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After
looking at the code more carefully, I've noticed that we're passing
instance_id and fields inside a dict as a positional argument. In
reality, they are expected as normal keyword arguments. The dict ends
up in 'details', which ends up being True. So in reality we're listing
all nodes and picking the first one.
[1] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz
[2] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1843050/+subscriptions
References