yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #66173
[Bug 1707003] [NEW] gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate
Public bug reported:
Looking at the Neutron Failure Rate dashboard, specifically the tempest
jobs:
http://grafana.openstack.org/dashboard/db/neutron-failure-
rate?panelId=10&fullscreen
One can see the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-
xenial-nv job has a very high failure rate, over 90% for the past 5
days.
Matt Riedemann did an analysis (which I'll paste below), but the summary
is that setup of the 3-node job is failing a lot, and not discovering
the third node, leading to a failure when that node is attempted to be
used.
So the first step is to change devstack-gate (?) code to wait for all
the subnodes to show up from a Nova perspective before proceeding.
There was a previous attempt at a grenade change in
https://review.openstack.org/#/c/426310/ that was abandoned, but that
seems like a good start based on the analysis.
Matt's comment #1:
Looking at the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-
xenial-nv job failure, the subnode-2 and subnode-3 all look OK as far as
their config. They use the same values for nova-cpu.conf pointing at the
nova_cell1 MQ which points at the cell1 conductor and cell1 database. I
see that the compute nodes are created for both subnode-2 and subnode-3
*after* discover_hosts runs:
2017-07-25 15:06:55.991684 | + /opt/stack/new/devstack-gate/devstack-vm-
gate.sh:main:L777: discover_hosts
Jul 25 15:06:58.945371 ubuntu-xenial-3-node-rax-iad-10067333-744503
nova-compute[794]: INFO nova.compute.resource_tracker [None req-
f69c76bf-0263-494b-8257-61617c90d799 None None] Compute node record
created for ubuntu-xenial-3-node-rax-iad-10067333-744503:ubuntu-xenial-3
-node-rax-iad-10067333-744503 with uuid: 1788fe0b-496c-4eda-b03a-
2cf4a2733a94
Jul 25 15:07:02.323379 ubuntu-xenial-3-node-rax-iad-10067333-744504
nova-compute[827]: INFO nova.compute.resource_tracker [None req-
95419fec-a2a7-467f-b167-d83755273a7a None None] Compute node record
created for ubuntu-xenial-3-node-rax-iad-10067333-744504:ubuntu-xenial-3
-node-rax-iad-10067333-744504 with uuid: ae3420a1-20d2-42a1-909d-
fc9cf1b14248
And looking at the discover_hosts output, only subnode-2 is discovered
as the unmapped host:
http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-
neutron-dvr-ha-multinode-full-ubuntu-xenial-nv/432c235/logs/devstack-
gate-discover-hosts.txt.gz
The compute node from the primary host is discovered and mapped to cell1
as part of the devstack run on the primary host:
http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-
neutron-dvr-ha-multinode-full-ubuntu-xenial-
nv/432c235/logs/devstacklog.txt.gz#_2017-07-25_14_50_45_831
So it seems that we are simply getting lucky by discovering the compute
node from subnode-2 and mapping it to cell1 but missing the compute node
from subnode-3, so it doesn't get mapped and then things fail when
Tempest tries to use it. This could be a problem on any 3 node job, and
might not just be related to this devstack change.
Matt's comment #2:
I've gone through the dvr-ha 3-node job failure and it just appears to
be a latent issue that we could also hit in 2-node jobs, and I even
noticed in a 2-node job that when the subnode compute node is created it
actually happens after we start running discover_hosts from the primary
via devstack-gate, so it just seems to be a race window, which we
already have, and maybe expose more in 3-node jobs if they are slower,
or slow down the traffic on the control node.
If you look at the cells v2 setup guide, it even says to make sure the
computes are created before running discover_hosts:
https://docs.openstack.org/nova/latest/user/cells.html
"Configure and start your compute hosts. Before step 7, make sure you
have compute hosts in the database by running nova service-list --binary
nova-compute."
Step 7 is running 'nova-manage cell_v2 discover_hosts'.
Ideally what we should be doing is have devstack-gate pass a variable to
the discover_hosts.sh script in devstack telling it how many compute
services we expect (3 in the case of the dvr-ha job) and then have that
discover_hosts.sh script run nova-compute service-list --binary nova-
compute and count the results until the expected number is hit, or it
times out, but then run discover_hosts. That's really what we expect
from other deployment tools like triple-o and kolla.
But overall I'm not finding anything in this change that's killing these
jobs outright, so let's get it in.
Matt's comment #3:
This is what I see for voting jobs that fail with the 'host is not
mapped to any cell' error:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Host%5C%22%20AND%20message%3A%5C%22is%20not%20mapped%20to%20any%20cell%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1%20AND%20build_status%3A%5C%22FAILURE%5C%22&from=7d
Those are all grenade multinode jobs.
Likely https://review.openstack.org/#/c/426310/, or a variant thereof,
would resolve it.
** Affects: neutron
Importance: High
Assignee: Brian Haley (brian-haley)
Status: Confirmed
** Tags: l3-dvr-backlog
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1707003
Title:
gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job
has a very high failure rate
Status in neutron:
Confirmed
Bug description:
Looking at the Neutron Failure Rate dashboard, specifically the
tempest jobs:
http://grafana.openstack.org/dashboard/db/neutron-failure-
rate?panelId=10&fullscreen
One can see the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-
ubuntu-xenial-nv job has a very high failure rate, over 90% for the
past 5 days.
Matt Riedemann did an analysis (which I'll paste below), but the
summary is that setup of the 3-node job is failing a lot, and not
discovering the third node, leading to a failure when that node is
attempted to be used.
So the first step is to change devstack-gate (?) code to wait for all
the subnodes to show up from a Nova perspective before proceeding.
There was a previous attempt at a grenade change in
https://review.openstack.org/#/c/426310/ that was abandoned, but that
seems like a good start based on the analysis.
Matt's comment #1:
Looking at the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-
xenial-nv job failure, the subnode-2 and subnode-3 all look OK as far
as their config. They use the same values for nova-cpu.conf pointing
at the nova_cell1 MQ which points at the cell1 conductor and cell1
database. I see that the compute nodes are created for both subnode-2
and subnode-3 *after* discover_hosts runs:
2017-07-25 15:06:55.991684 | + /opt/stack/new/devstack-gate/devstack-
vm-gate.sh:main:L777: discover_hosts
Jul 25 15:06:58.945371 ubuntu-xenial-3-node-rax-iad-10067333-744503
nova-compute[794]: INFO nova.compute.resource_tracker [None req-
f69c76bf-0263-494b-8257-61617c90d799 None None] Compute node record
created for ubuntu-xenial-3-node-rax-iad-10067333-744503:ubuntu-
xenial-3-node-rax-iad-10067333-744503 with uuid: 1788fe0b-496c-4eda-
b03a-2cf4a2733a94
Jul 25 15:07:02.323379 ubuntu-xenial-3-node-rax-iad-10067333-744504
nova-compute[827]: INFO nova.compute.resource_tracker [None req-
95419fec-a2a7-467f-b167-d83755273a7a None None] Compute node record
created for ubuntu-xenial-3-node-rax-iad-10067333-744504:ubuntu-
xenial-3-node-rax-iad-10067333-744504 with uuid: ae3420a1-20d2-42a1
-909d-fc9cf1b14248
And looking at the discover_hosts output, only subnode-2 is discovered
as the unmapped host:
http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-
neutron-dvr-ha-multinode-full-ubuntu-xenial-nv/432c235/logs/devstack-
gate-discover-hosts.txt.gz
The compute node from the primary host is discovered and mapped to
cell1 as part of the devstack run on the primary host:
http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-
neutron-dvr-ha-multinode-full-ubuntu-xenial-
nv/432c235/logs/devstacklog.txt.gz#_2017-07-25_14_50_45_831
So it seems that we are simply getting lucky by discovering the
compute node from subnode-2 and mapping it to cell1 but missing the
compute node from subnode-3, so it doesn't get mapped and then things
fail when Tempest tries to use it. This could be a problem on any 3
node job, and might not just be related to this devstack change.
Matt's comment #2:
I've gone through the dvr-ha 3-node job failure and it just appears to
be a latent issue that we could also hit in 2-node jobs, and I even
noticed in a 2-node job that when the subnode compute node is created
it actually happens after we start running discover_hosts from the
primary via devstack-gate, so it just seems to be a race window, which
we already have, and maybe expose more in 3-node jobs if they are
slower, or slow down the traffic on the control node.
If you look at the cells v2 setup guide, it even says to make sure the
computes are created before running discover_hosts:
https://docs.openstack.org/nova/latest/user/cells.html
"Configure and start your compute hosts. Before step 7, make sure you
have compute hosts in the database by running nova service-list
--binary nova-compute."
Step 7 is running 'nova-manage cell_v2 discover_hosts'.
Ideally what we should be doing is have devstack-gate pass a variable
to the discover_hosts.sh script in devstack telling it how many
compute services we expect (3 in the case of the dvr-ha job) and then
have that discover_hosts.sh script run nova-compute service-list
--binary nova-compute and count the results until the expected number
is hit, or it times out, but then run discover_hosts. That's really
what we expect from other deployment tools like triple-o and kolla.
But overall I'm not finding anything in this change that's killing
these jobs outright, so let's get it in.
Matt's comment #3:
This is what I see for voting jobs that fail with the 'host is not
mapped to any cell' error:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Host%5C%22%20AND%20message%3A%5C%22is%20not%20mapped%20to%20any%20cell%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1%20AND%20build_status%3A%5C%22FAILURE%5C%22&from=7d
Those are all grenade multinode jobs.
Likely https://review.openstack.org/#/c/426310/, or a variant thereof,
would resolve it.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1707003/+subscriptions
Follow ups