← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1806925] [NEW] evacuate test fails due to timeout waiting for evacuate to complete

 

Public bug reported:

In the post-test hook in the nova-live-migration job where we test
evacuate, we're doing the following:

1. create an image-backed and volume-backed server on the subnode
2. stop libvirtd on the local node
3. run evacuate to see it fail because nova-compute is disabled on the local node
4. restart libvirtd, wait for the local nova-compute service to be enabled, and then evacuate each server

In this failure, the evacuate times out because libvirtd is still
unavailable on the local node after we started the evacuate:

http://logs.openstack.org/54/620154/1/gate/nova-live-
migration/f040b76/logs/devstack-gate-
post_test_hook.txt.gz#_2018-12-05_10_05_50_130

2018-12-05 10:05:50.130 | +
/opt/stack/new/nova/gate/test_evacuate.sh:evacuate_and_wait_for_active:114
:   nova evacuate evacuate-test

nova-compute on the local host is back up here:

Dec 05 10:05:49.341595 ubuntu-xenial-ovh-bhs1-0000944602 nova-
compute[16115]: INFO nova.virt.libvirt.driver [None req-e14feea2-2abc-
43cc-b51f-f416f9dd5692 None None] Connection event '1' reason 'None'

The evacuate starts here:

http://logs.openstack.org/54/620154/1/gate/nova-live-
migration/f040b76/logs/screen-n-cpu.txt.gz#_Dec_05_10_05_54_156579

Dec 05 10:05:54.156579 ubuntu-xenial-ovh-bhs1-0000944602 nova-
compute[16115]: INFO nova.compute.manager [None req-c2f2a1d3-527f-4885
-8e4f-e82003a6d472 demo admin] [instance: 19ef59e3-de5a-42b2-b0aa-
d069702deedf] Evacuating instance

After that I don't see any failures, but the evacuation doesn't complete
within the 30 second timeout - maybe the timeout isn't long enough?

It looks like while we timeout, we're waiting for the network-vif-
plugged event from neutron:

http://logs.openstack.org/54/620154/1/gate/nova-live-
migration/f040b76/logs/screen-n-cpu.txt.gz#_Dec_05_10_06_04_554322

Dec 05 10:06:04.554322 ubuntu-xenial-ovh-bhs1-0000944602 nova-
compute[16115]: DEBUG nova.compute.manager [None req-c2f2a1d3-527f-4885
-8e4f-e82003a6d472 demo admin] [instance: 19ef59e3-de5a-42b2-b0aa-
d069702deedf] Preparing to wait for external event network-vif-plugged-
7d5ba599-9c7a-4e41-9fe4-3aff44a75458 {{(pid=16115)
prepare_for_instance_event
/opt/stack/new/nova/nova/compute/manager.py:327}}

The VIF is plugged here:

Dec 05 10:06:04.620986 ubuntu-xenial-ovh-bhs1-0000944602 nova-
compute[16115]: INFO os_vif [None req-c2f2a1d3-527f-4885-8e4f-
e82003a6d472 demo admin] Successfully plugged vif
VIFOpenVSwitch(active=False,address=fa:16:3e:e5:b1:9f,bridge_name='br-
int',has_traffic_filtering=True,id=7d5ba599-9c7a-
4e41-9fe4-3aff44a75458,network=Network(22273876-0d80-4450-8913-0102f3f79ccf),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap7d5ba599-9c')

And we timeout about a second or so later, but vif plugging usually
takes about 5 seconds to get the event back from neutron, and this is a
slower ovh node, so our timeout is likely just not long enough. To
compare, tempest's compute build_timeout is 300 seconds:

https://github.com/openstack/tempest/blob/eac094a8cf834d035316a900107f601adcc42ff5/tempest/config.py#L288

** Affects: nova
     Importance: High
         Status: Triaged


** Tags: evacuate gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1806925

Title:
  evacuate test fails due to timeout waiting for evacuate to complete

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  In the post-test hook in the nova-live-migration job where we test
  evacuate, we're doing the following:

  1. create an image-backed and volume-backed server on the subnode
  2. stop libvirtd on the local node
  3. run evacuate to see it fail because nova-compute is disabled on the local node
  4. restart libvirtd, wait for the local nova-compute service to be enabled, and then evacuate each server

  In this failure, the evacuate times out because libvirtd is still
  unavailable on the local node after we started the evacuate:

  http://logs.openstack.org/54/620154/1/gate/nova-live-
  migration/f040b76/logs/devstack-gate-
  post_test_hook.txt.gz#_2018-12-05_10_05_50_130

  2018-12-05 10:05:50.130 | +
  /opt/stack/new/nova/gate/test_evacuate.sh:evacuate_and_wait_for_active:114
  :   nova evacuate evacuate-test

  nova-compute on the local host is back up here:

  Dec 05 10:05:49.341595 ubuntu-xenial-ovh-bhs1-0000944602 nova-
  compute[16115]: INFO nova.virt.libvirt.driver [None req-e14feea2-2abc-
  43cc-b51f-f416f9dd5692 None None] Connection event '1' reason 'None'

  The evacuate starts here:

  http://logs.openstack.org/54/620154/1/gate/nova-live-
  migration/f040b76/logs/screen-n-cpu.txt.gz#_Dec_05_10_05_54_156579

  Dec 05 10:05:54.156579 ubuntu-xenial-ovh-bhs1-0000944602 nova-
  compute[16115]: INFO nova.compute.manager [None req-c2f2a1d3-527f-4885
  -8e4f-e82003a6d472 demo admin] [instance: 19ef59e3-de5a-42b2-b0aa-
  d069702deedf] Evacuating instance

  After that I don't see any failures, but the evacuation doesn't
  complete within the 30 second timeout - maybe the timeout isn't long
  enough?

  It looks like while we timeout, we're waiting for the network-vif-
  plugged event from neutron:

  http://logs.openstack.org/54/620154/1/gate/nova-live-
  migration/f040b76/logs/screen-n-cpu.txt.gz#_Dec_05_10_06_04_554322

  Dec 05 10:06:04.554322 ubuntu-xenial-ovh-bhs1-0000944602 nova-
  compute[16115]: DEBUG nova.compute.manager [None req-
  c2f2a1d3-527f-4885-8e4f-e82003a6d472 demo admin] [instance: 19ef59e3
  -de5a-42b2-b0aa-d069702deedf] Preparing to wait for external event
  network-vif-plugged-7d5ba599-9c7a-4e41-9fe4-3aff44a75458 {{(pid=16115)
  prepare_for_instance_event
  /opt/stack/new/nova/nova/compute/manager.py:327}}

  The VIF is plugged here:

  Dec 05 10:06:04.620986 ubuntu-xenial-ovh-bhs1-0000944602 nova-
  compute[16115]: INFO os_vif [None req-c2f2a1d3-527f-4885-8e4f-
  e82003a6d472 demo admin] Successfully plugged vif
  VIFOpenVSwitch(active=False,address=fa:16:3e:e5:b1:9f,bridge_name='br-
  int',has_traffic_filtering=True,id=7d5ba599-9c7a-
  4e41-9fe4-3aff44a75458,network=Network(22273876-0d80-4450-8913-0102f3f79ccf),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap7d5ba599-9c')

  And we timeout about a second or so later, but vif plugging usually
  takes about 5 seconds to get the event back from neutron, and this is
  a slower ovh node, so our timeout is likely just not long enough. To
  compare, tempest's compute build_timeout is 300 seconds:

  https://github.com/openstack/tempest/blob/eac094a8cf834d035316a900107f601adcc42ff5/tempest/config.py#L288

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1806925/+subscriptions


Follow ups