← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1813789] [NEW] Evacuate test intermittently fails with network-vif-plugged timeout exception

 

Public bug reported:

The nova-live-migration job has 2 nodes and in addition to running live
migration tests it also runs evacuate tests for image-backed and volume-
backed servers.

Seeing intermittent failures, with debug in this change:

https://review.openstack.org/#/c/571325/

It looks like the network-vif-plugged event is coming to the API before
nova-compute has registered a callback for that event, so nova-compute
does not process the event and then times out waiting for the event to
complete by the time it registered.

The API processes the network-vif-plugged for that server here:

http://logs.openstack.org/25/571325/10/gate/nova-live-
migration/52e1cd0/logs/screen-n-api.txt.gz#_Jan_29_01_11_49_707004

Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755
devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req-
3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-
83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling
method: <bound method ServerExternalEventsController.create of
<nova.api.openstack.compute.server_external_events.ServerExternalEventsController
object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed",
"tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-
plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]}
{{(pid=22322) _process_stack
/opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}

Jan 29 01:11:49.765687 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.compute.server_external_events [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Creating event network-vif-plugged:e241f79f-fb0d-4961-b0c8-aea9de2755bf for instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 on ubuntu-xenial-rax-ord-0002201758
Jan 29 01:11:49.776578 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.compute.api [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 is migrating, copying events to all relevant hosts: set([u'ubuntu-xenial-rax-ord-0002201758', u'ubuntu-xenial-rax-ord-0002201755']) {{(pid=22322) _get_relevant_hosts /opt/stack/new/nova/nova/compute/api.py:4818}}
Jan 29 01:11:49.786947 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.requestlog [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] 10.210.224.40 "POST /compute/v2.1/os-server-external-events" status: 200 len: 183 microversion: 2.1 time: 0.084393

Which is weird because that's before the vif plugging timeout in the
compute logs. From the compute logs, we plugged the vif here:

Jan 29 01:11:55.571807 ubuntu-xenial-rax-ord-0002201755 nova-
compute[15903]: INFO os_vif [None req-252304a1-6eff-4ff3-aa4d-
b4e0ab87601c demo admin] Successfully plugged vif
VIFOpenVSwitch(active=False,address=fa:16:3e:66:03:76,bridge_name='br-
int',has_traffic_filtering=True,id=e241f79f-
fb0d-4961-b0c8-aea9de2755bf,network=Network(2812d9e2-bfa8-4397-9a81-5dab9fe5a03e),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name
='tape241f79f-fb')

So it looks like maybe we're getting the network-vif-plugged event
before we're ready for it and so we miss the event and then timeout?
That's pretty weird though because the libvirt driver's spawn method is
what should be registering and waiting for the vif plugged event:

https://github.com/openstack/nova/blob/c134feda3d9527dbc9735e4ae9cd35c4782f1fb4/nova/virt/libvirt/driver.py#L5673

Looks like the compute does register the event callback for network-vif-
plugged here:

Jan 29 01:11:55.567189 ubuntu-xenial-rax-ord-0002201755 nova-
compute[15903]: DEBUG nova.compute.manager [None req-252304a1-6eff-4ff3
-aa4d-b4e0ab87601c demo admin] [instance: 2e82ddcd-75b8-4a41-8ecd-
ce175adbdc67] Preparing to wait for external event network-vif-plugged-
e241f79f-fb0d-4961-b0c8-aea9de2755bf {{(pid=15903)
prepare_for_instance_event
/opt/stack/new/nova/nova/compute/manager.py:325}}

Which is too late:

Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755
devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req-
3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-
83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling
method: <bound method ServerExternalEventsController.create of
<nova.api.openstack.compute.server_external_events.ServerExternalEventsController
object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed",
"tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-
plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]}
{{(pid=22322) _process_stack
/opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timeout%20waiting%20for%20%5B
('network-vif-
plugged'%5C%22%20AND%20message%3A%5C%22for%20instance%20with%20vm_state%20error%20and%20task_state%20rebuild_spawning.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

4 hits in the last 7 days, check and gate, all failures.

** Affects: nova
     Importance: Medium
         Status: Confirmed


** Tags: evacuate gate-failure networking

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1813789

Title:
  Evacuate test intermittently fails with network-vif-plugged timeout
  exception

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  The nova-live-migration job has 2 nodes and in addition to running
  live migration tests it also runs evacuate tests for image-backed and
  volume-backed servers.

  Seeing intermittent failures, with debug in this change:

  https://review.openstack.org/#/c/571325/

  It looks like the network-vif-plugged event is coming to the API
  before nova-compute has registered a callback for that event, so nova-
  compute does not process the event and then times out waiting for the
  event to complete by the time it registered.

  The API processes the network-vif-plugged for that server here:

  http://logs.openstack.org/25/571325/10/gate/nova-live-
  migration/52e1cd0/logs/screen-n-api.txt.gz#_Jan_29_01_11_49_707004

  Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755
  devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req-
  3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-
  83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling
  method: <bound method ServerExternalEventsController.create of
  <nova.api.openstack.compute.server_external_events.ServerExternalEventsController
  object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed",
  "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-
  plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]}
  {{(pid=22322) _process_stack
  /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}

  Jan 29 01:11:49.765687 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.compute.server_external_events [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Creating event network-vif-plugged:e241f79f-fb0d-4961-b0c8-aea9de2755bf for instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 on ubuntu-xenial-rax-ord-0002201758
  Jan 29 01:11:49.776578 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.compute.api [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 is migrating, copying events to all relevant hosts: set([u'ubuntu-xenial-rax-ord-0002201758', u'ubuntu-xenial-rax-ord-0002201755']) {{(pid=22322) _get_relevant_hosts /opt/stack/new/nova/nova/compute/api.py:4818}}
  Jan 29 01:11:49.786947 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.requestlog [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] 10.210.224.40 "POST /compute/v2.1/os-server-external-events" status: 200 len: 183 microversion: 2.1 time: 0.084393

  Which is weird because that's before the vif plugging timeout in the
  compute logs. From the compute logs, we plugged the vif here:

  Jan 29 01:11:55.571807 ubuntu-xenial-rax-ord-0002201755 nova-
  compute[15903]: INFO os_vif [None req-252304a1-6eff-4ff3-aa4d-
  b4e0ab87601c demo admin] Successfully plugged vif
  VIFOpenVSwitch(active=False,address=fa:16:3e:66:03:76,bridge_name='br-
  int',has_traffic_filtering=True,id=e241f79f-
  fb0d-4961-b0c8-aea9de2755bf,network=Network(2812d9e2-bfa8-4397-9a81-5dab9fe5a03e),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name
  ='tape241f79f-fb')

  So it looks like maybe we're getting the network-vif-plugged event
  before we're ready for it and so we miss the event and then timeout?
  That's pretty weird though because the libvirt driver's spawn method
  is what should be registering and waiting for the vif plugged event:

  https://github.com/openstack/nova/blob/c134feda3d9527dbc9735e4ae9cd35c4782f1fb4/nova/virt/libvirt/driver.py#L5673

  Looks like the compute does register the event callback for network-
  vif-plugged here:

  Jan 29 01:11:55.567189 ubuntu-xenial-rax-ord-0002201755 nova-
  compute[15903]: DEBUG nova.compute.manager [None req-252304a1-6eff-
  4ff3-aa4d-b4e0ab87601c demo admin] [instance: 2e82ddcd-75b8-4a41-8ecd-
  ce175adbdc67] Preparing to wait for external event network-vif-
  plugged-e241f79f-fb0d-4961-b0c8-aea9de2755bf {{(pid=15903)
  prepare_for_instance_event
  /opt/stack/new/nova/nova/compute/manager.py:325}}

  Which is too late:

  Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755
  devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req-
  3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-
  83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling
  method: <bound method ServerExternalEventsController.create of
  <nova.api.openstack.compute.server_external_events.ServerExternalEventsController
  object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed",
  "tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-
  plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]}
  {{(pid=22322) _process_stack
  /opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}

  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timeout%20waiting%20for%20%5B
  ('network-vif-
  plugged'%5C%22%20AND%20message%3A%5C%22for%20instance%20with%20vm_state%20error%20and%20task_state%20rebuild_spawning.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

  4 hits in the last 7 days, check and gate, all failures.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1813789/+subscriptions


Follow ups