yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #76789
[Bug 1813789] [NEW] Evacuate test intermittently fails with network-vif-plugged timeout exception
Public bug reported:
The nova-live-migration job has 2 nodes and in addition to running live
migration tests it also runs evacuate tests for image-backed and volume-
backed servers.
Seeing intermittent failures, with debug in this change:
https://review.openstack.org/#/c/571325/
It looks like the network-vif-plugged event is coming to the API before
nova-compute has registered a callback for that event, so nova-compute
does not process the event and then times out waiting for the event to
complete by the time it registered.
The API processes the network-vif-plugged for that server here:
http://logs.openstack.org/25/571325/10/gate/nova-live-
migration/52e1cd0/logs/screen-n-api.txt.gz#_Jan_29_01_11_49_707004
Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755
devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req-
3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-
83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling
method: <bound method ServerExternalEventsController.create of
<nova.api.openstack.compute.server_external_events.ServerExternalEventsController
object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed",
"tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-
plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]}
{{(pid=22322) _process_stack
/opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}
Jan 29 01:11:49.765687 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.compute.server_external_events [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Creating event network-vif-plugged:e241f79f-fb0d-4961-b0c8-aea9de2755bf for instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 on ubuntu-xenial-rax-ord-0002201758
Jan 29 01:11:49.776578 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.compute.api [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 is migrating, copying events to all relevant hosts: set([u'ubuntu-xenial-rax-ord-0002201758', u'ubuntu-xenial-rax-ord-0002201755']) {{(pid=22322) _get_relevant_hosts /opt/stack/new/nova/nova/compute/api.py:4818}}
Jan 29 01:11:49.786947 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.requestlog [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] 10.210.224.40 "POST /compute/v2.1/os-server-external-events" status: 200 len: 183 microversion: 2.1 time: 0.084393
Which is weird because that's before the vif plugging timeout in the
compute logs. From the compute logs, we plugged the vif here:
Jan 29 01:11:55.571807 ubuntu-xenial-rax-ord-0002201755 nova-
compute[15903]: INFO os_vif [None req-252304a1-6eff-4ff3-aa4d-
b4e0ab87601c demo admin] Successfully plugged vif
VIFOpenVSwitch(active=False,address=fa:16:3e:66:03:76,bridge_name='br-
int',has_traffic_filtering=True,id=e241f79f-
fb0d-4961-b0c8-aea9de2755bf,network=Network(2812d9e2-bfa8-4397-9a81-5dab9fe5a03e),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name
='tape241f79f-fb')
So it looks like maybe we're getting the network-vif-plugged event
before we're ready for it and so we miss the event and then timeout?
That's pretty weird though because the libvirt driver's spawn method is
what should be registering and waiting for the vif plugged event:
https://github.com/openstack/nova/blob/c134feda3d9527dbc9735e4ae9cd35c4782f1fb4/nova/virt/libvirt/driver.py#L5673
Looks like the compute does register the event callback for network-vif-
plugged here:
Jan 29 01:11:55.567189 ubuntu-xenial-rax-ord-0002201755 nova-
compute[15903]: DEBUG nova.compute.manager [None req-252304a1-6eff-4ff3
-aa4d-b4e0ab87601c demo admin] [instance: 2e82ddcd-75b8-4a41-8ecd-
ce175adbdc67] Preparing to wait for external event network-vif-plugged-
e241f79f-fb0d-4961-b0c8-aea9de2755bf {{(pid=15903)
prepare_for_instance_event
/opt/stack/new/nova/nova/compute/manager.py:325}}
Which is too late:
Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755
devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req-
3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-
83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling
method: <bound method ServerExternalEventsController.create of
<nova.api.openstack.compute.server_external_events.ServerExternalEventsController
object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed",
"tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-
plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]}
{{(pid=22322) _process_stack
/opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timeout%20waiting%20for%20%5B
('network-vif-
plugged'%5C%22%20AND%20message%3A%5C%22for%20instance%20with%20vm_state%20error%20and%20task_state%20rebuild_spawning.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d
4 hits in the last 7 days, check and gate, all failures.
** Affects: nova
Importance: Medium
Status: Confirmed
** Tags: evacuate gate-failure networking
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1813789
Title:
Evacuate test intermittently fails with network-vif-plugged timeout
exception
Status in OpenStack Compute (nova):
Confirmed
Bug description:
The nova-live-migration job has 2 nodes and in addition to running
live migration tests it also runs evacuate tests for image-backed and
volume-backed servers.
Seeing intermittent failures, with debug in this change:
https://review.openstack.org/#/c/571325/
It looks like the network-vif-plugged event is coming to the API
before nova-compute has registered a callback for that event, so nova-
compute does not process the event and then times out waiting for the
event to complete by the time it registered.
The API processes the network-vif-plugged for that server here:
http://logs.openstack.org/25/571325/10/gate/nova-live-
migration/52e1cd0/logs/screen-n-api.txt.gz#_Jan_29_01_11_49_707004
Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755
devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req-
3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-
83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling
method: <bound method ServerExternalEventsController.create of
<nova.api.openstack.compute.server_external_events.ServerExternalEventsController
object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed",
"tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-
plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]}
{{(pid=22322) _process_stack
/opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}
Jan 29 01:11:49.765687 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.compute.server_external_events [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Creating event network-vif-plugged:e241f79f-fb0d-4961-b0c8-aea9de2755bf for instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 on ubuntu-xenial-rax-ord-0002201758
Jan 29 01:11:49.776578 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: DEBUG nova.compute.api [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] Instance 2e82ddcd-75b8-4a41-8ecd-ce175adbdc67 is migrating, copying events to all relevant hosts: set([u'ubuntu-xenial-rax-ord-0002201758', u'ubuntu-xenial-rax-ord-0002201755']) {{(pid=22322) _get_relevant_hosts /opt/stack/new/nova/nova/compute/api.py:4818}}
Jan 29 01:11:49.786947 ubuntu-xenial-rax-ord-0002201755 devstack@n-api.service[22319]: INFO nova.api.openstack.requestlog [req-3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-83c6-44d5-afd4-b5435e71d61f service nova] 10.210.224.40 "POST /compute/v2.1/os-server-external-events" status: 200 len: 183 microversion: 2.1 time: 0.084393
Which is weird because that's before the vif plugging timeout in the
compute logs. From the compute logs, we plugged the vif here:
Jan 29 01:11:55.571807 ubuntu-xenial-rax-ord-0002201755 nova-
compute[15903]: INFO os_vif [None req-252304a1-6eff-4ff3-aa4d-
b4e0ab87601c demo admin] Successfully plugged vif
VIFOpenVSwitch(active=False,address=fa:16:3e:66:03:76,bridge_name='br-
int',has_traffic_filtering=True,id=e241f79f-
fb0d-4961-b0c8-aea9de2755bf,network=Network(2812d9e2-bfa8-4397-9a81-5dab9fe5a03e),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name
='tape241f79f-fb')
So it looks like maybe we're getting the network-vif-plugged event
before we're ready for it and so we miss the event and then timeout?
That's pretty weird though because the libvirt driver's spawn method
is what should be registering and waiting for the vif plugged event:
https://github.com/openstack/nova/blob/c134feda3d9527dbc9735e4ae9cd35c4782f1fb4/nova/virt/libvirt/driver.py#L5673
Looks like the compute does register the event callback for network-
vif-plugged here:
Jan 29 01:11:55.567189 ubuntu-xenial-rax-ord-0002201755 nova-
compute[15903]: DEBUG nova.compute.manager [None req-252304a1-6eff-
4ff3-aa4d-b4e0ab87601c demo admin] [instance: 2e82ddcd-75b8-4a41-8ecd-
ce175adbdc67] Preparing to wait for external event network-vif-
plugged-e241f79f-fb0d-4961-b0c8-aea9de2755bf {{(pid=15903)
prepare_for_instance_event
/opt/stack/new/nova/nova/compute/manager.py:325}}
Which is too late:
Jan 29 01:11:49.707004 ubuntu-xenial-rax-ord-0002201755
devstack@n-api.service[22319]: DEBUG nova.api.openstack.wsgi [req-
3ed0ada1-7328-4d5a-a3ea-da34dcdb252d req-1f5aeede-
83c6-44d5-afd4-b5435e71d61f service nova] Action: 'create', calling
method: <bound method ServerExternalEventsController.create of
<nova.api.openstack.compute.server_external_events.ServerExternalEventsController
object at 0x7fb0133aa050>>, body: {"events": [{"status": "completed",
"tag": "e241f79f-fb0d-4961-b0c8-aea9de2755bf", "name": "network-vif-
plugged", "server_uuid": "2e82ddcd-75b8-4a41-8ecd-ce175adbdc67"}]}
{{(pid=22322) _process_stack
/opt/stack/new/nova/nova/api/openstack/wsgi.py:520}}
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timeout%20waiting%20for%20%5B
('network-vif-
plugged'%5C%22%20AND%20message%3A%5C%22for%20instance%20with%20vm_state%20error%20and%20task_state%20rebuild_spawning.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d
4 hits in the last 7 days, check and gate, all failures.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1813789/+subscriptions
Follow ups