yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1707003] [NEW] gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Brian Haley <1707003@xxxxxxxxxxxxxxxxxx>
Date: Thu, 27 Jul 2017 15:34:10 -0000
Reply-to: Bug 1707003 <1707003@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Looking at the Neutron Failure Rate dashboard, specifically the tempest
jobs:

http://grafana.openstack.org/dashboard/db/neutron-failure-
rate?panelId=10&fullscreen

One can see the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-
xenial-nv job has a very high failure rate, over 90% for the past 5
days.

Matt Riedemann did an analysis (which I'll paste below), but the summary
is that setup of the 3-node job is failing a lot, and not discovering
the third node, leading to a failure when that node is attempted to be
used.

So the first step is to change devstack-gate (?) code to wait for all
the subnodes to show up from a Nova perspective before proceeding.
There was a previous attempt at a grenade change in
https://review.openstack.org/#/c/426310/ that was abandoned, but that
seems like a good start based on the analysis.


Matt's comment #1:

Looking at the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-
xenial-nv job failure, the subnode-2 and subnode-3 all look OK as far as
their config. They use the same values for nova-cpu.conf pointing at the
nova_cell1 MQ which points at the cell1 conductor and cell1 database. I
see that the compute nodes are created for both subnode-2 and subnode-3
*after* discover_hosts runs:

2017-07-25 15:06:55.991684 | + /opt/stack/new/devstack-gate/devstack-vm-
gate.sh:main:L777:   discover_hosts

Jul 25 15:06:58.945371 ubuntu-xenial-3-node-rax-iad-10067333-744503
nova-compute[794]: INFO nova.compute.resource_tracker [None req-
f69c76bf-0263-494b-8257-61617c90d799 None None] Compute node record
created for ubuntu-xenial-3-node-rax-iad-10067333-744503:ubuntu-xenial-3
-node-rax-iad-10067333-744503 with uuid: 1788fe0b-496c-4eda-b03a-
2cf4a2733a94

Jul 25 15:07:02.323379 ubuntu-xenial-3-node-rax-iad-10067333-744504
nova-compute[827]: INFO nova.compute.resource_tracker [None req-
95419fec-a2a7-467f-b167-d83755273a7a None None] Compute node record
created for ubuntu-xenial-3-node-rax-iad-10067333-744504:ubuntu-xenial-3
-node-rax-iad-10067333-744504 with uuid: ae3420a1-20d2-42a1-909d-
fc9cf1b14248

And looking at the discover_hosts output, only subnode-2 is discovered
as the unmapped host:

http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-
neutron-dvr-ha-multinode-full-ubuntu-xenial-nv/432c235/logs/devstack-
gate-discover-hosts.txt.gz

The compute node from the primary host is discovered and mapped to cell1
as part of the devstack run on the primary host:

http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-
neutron-dvr-ha-multinode-full-ubuntu-xenial-
nv/432c235/logs/devstacklog.txt.gz#_2017-07-25_14_50_45_831

So it seems that we are simply getting lucky by discovering the compute
node from subnode-2 and mapping it to cell1 but missing the compute node
from subnode-3, so it doesn't get mapped and then things fail when
Tempest tries to use it. This could be a problem on any 3 node job, and
might not just be related to this devstack change.


Matt's comment #2:

I've gone through the dvr-ha 3-node job failure and it just appears to
be a latent issue that we could also hit in 2-node jobs, and I even
noticed in a 2-node job that when the subnode compute node is created it
actually happens after we start running discover_hosts from the primary
via devstack-gate, so it just seems to be a race window, which we
already have, and maybe expose more in 3-node jobs if they are slower,
or slow down the traffic on the control node.

If you look at the cells v2 setup guide, it even says to make sure the
computes are created before running discover_hosts:

https://docs.openstack.org/nova/latest/user/cells.html

"Configure and start your compute hosts. Before step 7, make sure you
have compute hosts in the database by running nova service-list --binary
nova-compute."

Step 7 is running 'nova-manage cell_v2 discover_hosts'.

Ideally what we should be doing is have devstack-gate pass a variable to
the discover_hosts.sh script in devstack telling it how many compute
services we expect (3 in the case of the dvr-ha job) and then have that
discover_hosts.sh script run nova-compute service-list --binary nova-
compute and count the results until the expected number is hit, or it
times out, but then run discover_hosts. That's really what we expect
from other deployment tools like triple-o and kolla.

But overall I'm not finding anything in this change that's killing these
jobs outright, so let's get it in.


Matt's comment #3:

This is what I see for voting jobs that fail with the 'host is not
mapped to any cell' error:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Host%5C%22%20AND%20message%3A%5C%22is%20not%20mapped%20to%20any%20cell%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1%20AND%20build_status%3A%5C%22FAILURE%5C%22&from=7d

Those are all grenade multinode jobs.

Likely https://review.openstack.org/#/c/426310/, or a variant thereof,
would resolve it.

** Affects: neutron
     Importance: High
     Assignee: Brian Haley (brian-haley)
         Status: Confirmed


** Tags: l3-dvr-backlog

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1707003

Title:
  gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job
  has a very high failure rate

Status in neutron:
  Confirmed

Bug description:
  Looking at the Neutron Failure Rate dashboard, specifically the
  tempest jobs:

  http://grafana.openstack.org/dashboard/db/neutron-failure-
  rate?panelId=10&fullscreen

  One can see the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-
  ubuntu-xenial-nv job has a very high failure rate, over 90% for the
  past 5 days.

  Matt Riedemann did an analysis (which I'll paste below), but the
  summary is that setup of the 3-node job is failing a lot, and not
  discovering the third node, leading to a failure when that node is
  attempted to be used.

  So the first step is to change devstack-gate (?) code to wait for all
  the subnodes to show up from a Nova perspective before proceeding.
  There was a previous attempt at a grenade change in
  https://review.openstack.org/#/c/426310/ that was abandoned, but that
  seems like a good start based on the analysis.

  
  Matt's comment #1:

  Looking at the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-
  xenial-nv job failure, the subnode-2 and subnode-3 all look OK as far
  as their config. They use the same values for nova-cpu.conf pointing
  at the nova_cell1 MQ which points at the cell1 conductor and cell1
  database. I see that the compute nodes are created for both subnode-2
  and subnode-3 *after* discover_hosts runs:

  2017-07-25 15:06:55.991684 | + /opt/stack/new/devstack-gate/devstack-
  vm-gate.sh:main:L777:   discover_hosts

  Jul 25 15:06:58.945371 ubuntu-xenial-3-node-rax-iad-10067333-744503
  nova-compute[794]: INFO nova.compute.resource_tracker [None req-
  f69c76bf-0263-494b-8257-61617c90d799 None None] Compute node record
  created for ubuntu-xenial-3-node-rax-iad-10067333-744503:ubuntu-
  xenial-3-node-rax-iad-10067333-744503 with uuid: 1788fe0b-496c-4eda-
  b03a-2cf4a2733a94

  Jul 25 15:07:02.323379 ubuntu-xenial-3-node-rax-iad-10067333-744504
  nova-compute[827]: INFO nova.compute.resource_tracker [None req-
  95419fec-a2a7-467f-b167-d83755273a7a None None] Compute node record
  created for ubuntu-xenial-3-node-rax-iad-10067333-744504:ubuntu-
  xenial-3-node-rax-iad-10067333-744504 with uuid: ae3420a1-20d2-42a1
  -909d-fc9cf1b14248

  And looking at the discover_hosts output, only subnode-2 is discovered
  as the unmapped host:

  http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-
  neutron-dvr-ha-multinode-full-ubuntu-xenial-nv/432c235/logs/devstack-
  gate-discover-hosts.txt.gz

  The compute node from the primary host is discovered and mapped to
  cell1 as part of the devstack run on the primary host:

  http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-
  neutron-dvr-ha-multinode-full-ubuntu-xenial-
  nv/432c235/logs/devstacklog.txt.gz#_2017-07-25_14_50_45_831

  So it seems that we are simply getting lucky by discovering the
  compute node from subnode-2 and mapping it to cell1 but missing the
  compute node from subnode-3, so it doesn't get mapped and then things
  fail when Tempest tries to use it. This could be a problem on any 3
  node job, and might not just be related to this devstack change.

  
  Matt's comment #2:

  I've gone through the dvr-ha 3-node job failure and it just appears to
  be a latent issue that we could also hit in 2-node jobs, and I even
  noticed in a 2-node job that when the subnode compute node is created
  it actually happens after we start running discover_hosts from the
  primary via devstack-gate, so it just seems to be a race window, which
  we already have, and maybe expose more in 3-node jobs if they are
  slower, or slow down the traffic on the control node.

  If you look at the cells v2 setup guide, it even says to make sure the
  computes are created before running discover_hosts:

  https://docs.openstack.org/nova/latest/user/cells.html

  "Configure and start your compute hosts. Before step 7, make sure you
  have compute hosts in the database by running nova service-list
  --binary nova-compute."

  Step 7 is running 'nova-manage cell_v2 discover_hosts'.

  Ideally what we should be doing is have devstack-gate pass a variable
  to the discover_hosts.sh script in devstack telling it how many
  compute services we expect (3 in the case of the dvr-ha job) and then
  have that discover_hosts.sh script run nova-compute service-list
  --binary nova-compute and count the results until the expected number
  is hit, or it times out, but then run discover_hosts. That's really
  what we expect from other deployment tools like triple-o and kolla.

  But overall I'm not finding anything in this change that's killing
  these jobs outright, so let's get it in.

  
  Matt's comment #3:

  This is what I see for voting jobs that fail with the 'host is not
  mapped to any cell' error:

  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Host%5C%22%20AND%20message%3A%5C%22is%20not%20mapped%20to%20any%20cell%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1%20AND%20build_status%3A%5C%22FAILURE%5C%22&from=7d

  Those are all grenade multinode jobs.

  Likely https://review.openstack.org/#/c/426310/, or a variant thereof,
  would resolve it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1707003/+subscriptions
Follow ups

[Bug 1707003] Re: gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate
From: Ihar Hrachyshka, 2018-03-15