yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #79859
[Bug 1843050] [NEW] Regression in the ironic driver: wait_for_active checks a random node
Public bug reported:
Due to a mistake in the SDK patches, the wait_for_active loop tries to
check a random node instead of the target node. While it works fine in
the default job with only one node at all, it fails in the multinode
grenade.
The failure [1] looks a bit cryptic:
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall [-] Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Traceback (most recent call last):
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall raise exception.InstanceNotFound(instance_id=instance.uuid)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall
Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
This code path is usually hit when a node suddenly goes into AVAILABLE,
which clearly shouldn't be the case here. The mistery got solved by
looking at ironic-api logs [2]:
[Thu Sep 5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated
41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1
switches on core 0)
This does not match the expected query: GET
/baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After
looking at the code more carefully, I've noticed that we're passing
instance_id and fields inside a dict as a positional argument. In
reality, they are expected as normal keyword arguments. The dict ends up
in 'details', which ends up being True. So in reality we're listing all
nodes and picking the first one.
[1] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz
[2] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz
** Affects: nova
Importance: Critical
Assignee: Dmitry Tantsur (divius)
Status: In Progress
** Tags: ironic
** Description changed:
Due to a mistake in the SDK patches, the wait_for_active loop tries to
check a random node instead of the target node. While it works fine in
the default job with only one node at all, it fails in the multinode
grenade.
- The failure looks a bit cryptic:
+ The failure [1] looks a bit cryptic:
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall [-] Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Traceback (most recent call last):
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall raise exception.InstanceNotFound(instance_id=instance.uuid)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
- Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall
+ Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall
Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
This code path is usually hit when a node suddenly goes into AVAILABLE,
which clearly shouldn't be the case here. The mistery got solved by
- looking at ironic-api logs:
+ looking at ironic-api logs [2]:
[Thu Sep 5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated
41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1
switches on core 0)
This does not match the expected query: GET
/baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After
looking at the code more carefully, I've noticed that we're passing
instance_id and fields inside a dict as a positional argument. In
reality, they are expected as normal keyword arguments. The dict ends up
in 'details', which ends up being True. So in reality we're listing all
nodes and picking the first one.
+
+ [1] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz
+ [2] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz
** Changed in: nova
Assignee: (unassigned) => Dmitry Tantsur (divius)
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1843050
Title:
Regression in the ironic driver: wait_for_active checks a random node
Status in OpenStack Compute (nova):
In Progress
Bug description:
Due to a mistake in the SDK patches, the wait_for_active loop tries to
check a random node instead of the target node. While it works fine in
the default job with only one node at all, it fails in the multinode
grenade.
The failure [1] looks a bit cryptic:
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall [-] Fixed interval looping call 'nova.virt.ironic.driver.IronicDriver._wait_for_active' failed: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall Traceback (most recent call last):
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 150, in _run_loop
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall File "/opt/stack/nova/nova/virt/ironic/driver.py", line 542, in _wait_for_active
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall raise exception.InstanceNotFound(instance_id=instance.uuid)
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
Sep 05 12:08:12.286181 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR oslo.service.loopingcall
Sep 05 12:08:12.288474 ubuntu-bionic-rax-iad-0010814654 nova-compute[19863]: ERROR nova.virt.ironic.driver [None req-fcb58bd7-803c-46ce-bc7e-3b0cfbac2a77 tempest-BaremetalBasicOps-911057241 tempest-BaremetalBasicOps-911057241] Error deploying instance 09957ef9-d006-4328-8342-03a712888996 on baremetal node 3040f681-72ff-4bcc-b0f6-129cfd146cf1.: InstanceNotFound: Instance 09957ef9-d006-4328-8342-03a712888996 could not be found.
This code path is usually hit when a node suddenly goes into
AVAILABLE, which clearly shouldn't be the case here. The mistery got
solved by looking at ironic-api logs [2]:
[Thu Sep 5 12:08:11 2019] GET /baremetal/v1/nodes/detail => generated
41705 bytes in 350 msecs (HTTP/1.1 200) 7 headers in 286 bytes (1
switches on core 0)
This does not match the expected query: GET
/baremetal/v1/nodes/?instance_uuid=<uuid>,fields=<blah-blah>. After
looking at the code more carefully, I've noticed that we're passing
instance_id and fields inside a dict as a positional argument. In
reality, they are expected as normal keyword arguments. The dict ends
up in 'details', which ends up being True. So in reality we're listing
all nodes and picking the first one.
[1] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-n-cpu.txt.gz
[2] https://b817a98b71efe3dffdbb-b81d4381f8c07328b28c126876a3e27b.ssl.cf2.rackcdn.com/679949/1/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/d22d8d9/controller/logs/screen-ir-api.txt.gz
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1843050/+subscriptions
Follow ups