← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1911474] [NEW] Nova evacuate instance fails as port tries to bind to dead host

 

Public bug reported:

Description
===========
Running Nova evacuate (instance ID) (compute) works once, but then fails for the second instance. Logs show that Nova tells Neutron to bind the port on the dead compute node. This issue has happened twice to us in production, but we haven't been able to reproduce outside of production.

Steps to reproduce
==================
1. Have a dead compute node (shutdown may be fine?)
2. Execute "nova evacuate EVACUATED_SERVER_NAME HOST_B"
3. Execute "nova evacuate EVACUATED_SERVER_NAME2 HOST_B"
4. Watch in horror as neutron tries to bind the port to the dead host


Expected result
===============
A working VM evacuated to a working compute host.

Actual result
=============
A failed evacuate, a VM in error status on a dead compute host.


Environment
===========
CentOS 7.8.2003
Openstack Rocky
openstack-nova-scheduler-18.3.0-1.el7.noarch
openstack-nova-api-18.3.0-1.el7.noarch
openstack-nova-common-18.3.0-1.el7.noarch
openstack-nova-novncproxy-18.3.0-1.el7.noarch
openstack-nova-placement-api-18.3.0-1.el7.noarch
python-nova-18.3.0-1.el7.noarch
openstack-nova-console-18.3.0-1.el7.noarch
python2-novaclient-11.0.1-1.el7.noarch
openstack-nova-conductor-18.3.0-1.el7.noarch

Libvirt + KVM
libvirt-4.5.0-33.el7_8.1.x86_64
qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64

Ceph 14.2.11

Neutron-Openvswitch

Logs & Configs
==============
*These logs are from production which wasn't in Debug mode when the issue happened. Since this is production, I can't really force reproduction of this issue while debug is on*

f3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22
2021-01-06 13:31:31.750 2858 INFO nova.osapi_compute.wsgi.server [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] 10.30.1.224 "POST /v2.1/os-server-external-events HTTP/1.1" status: 200 len: 1091 time: 0.4987640
2021-01-06 13:31:40.145 2863 INFO nova.osapi_compute.wsgi.server [req-abaac9df-7338-4d10-9326-4006021ff54d 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1 HTTP/1.1" status: 302 len: 318 time: 0.0072701
2021-01-06 13:31:40.156 2863 INFO nova.osapi_compute.wsgi.server [req-c393e74b-a118-4a98-8a83-be6007913dc0 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/ HTTP/1.1" status: 200 len: 789 time: 0.0070350
2021-01-06 13:31:43.289 2865 INFO nova.osapi_compute.wsgi.server [req-b87268b7-a673-44c1-9162-f9564647ec33 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc HTTP/1.1" status: 200 len: 5654 time: 2.7543190
2021-01-06 13:31:43.413 2863 INFO nova.osapi_compute.wsgi.server [req-4cab23ba-c5cb-4dda-bf42-bc452d004783 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc/os-volume_attachments HTTP/1.1" status: 200 len: 770 time: 0.1135709
2021-01-06 13:31:43.883 2865 INFO nova.osapi_compute.wsgi.server [req-f5e5a586-65f3-4798-b03b-98e01326a00b 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/flavors/574a7152-f079-4337-b1eb-b7eca4370b73 HTTP/1.1" status: 200 len: 877 time: 0.5751688
2021-01-06 13:31:47.194 2864 INFO nova.api.openstack.compute.server_external_events [req-7e639b1f-8408-4e8e-9bb8-54588290edfe 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22

*I was asked to use the event list to find further logs. However, the
event list did not return anything when I queried it. Could this be
because the problematic instance was deleted and recreated?*

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1911474

Title:
  Nova evacuate instance fails as port tries to bind to dead host

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  Running Nova evacuate (instance ID) (compute) works once, but then fails for the second instance. Logs show that Nova tells Neutron to bind the port on the dead compute node. This issue has happened twice to us in production, but we haven't been able to reproduce outside of production.

  Steps to reproduce
  ==================
  1. Have a dead compute node (shutdown may be fine?)
  2. Execute "nova evacuate EVACUATED_SERVER_NAME HOST_B"
  3. Execute "nova evacuate EVACUATED_SERVER_NAME2 HOST_B"
  4. Watch in horror as neutron tries to bind the port to the dead host

  
  Expected result
  ===============
  A working VM evacuated to a working compute host.

  Actual result
  =============
  A failed evacuate, a VM in error status on a dead compute host.

  
  Environment
  ===========
  CentOS 7.8.2003
  Openstack Rocky
  openstack-nova-scheduler-18.3.0-1.el7.noarch
  openstack-nova-api-18.3.0-1.el7.noarch
  openstack-nova-common-18.3.0-1.el7.noarch
  openstack-nova-novncproxy-18.3.0-1.el7.noarch
  openstack-nova-placement-api-18.3.0-1.el7.noarch
  python-nova-18.3.0-1.el7.noarch
  openstack-nova-console-18.3.0-1.el7.noarch
  python2-novaclient-11.0.1-1.el7.noarch
  openstack-nova-conductor-18.3.0-1.el7.noarch

  Libvirt + KVM
  libvirt-4.5.0-33.el7_8.1.x86_64
  qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64

  Ceph 14.2.11

  Neutron-Openvswitch

  Logs & Configs
  ==============
  *These logs are from production which wasn't in Debug mode when the issue happened. Since this is production, I can't really force reproduction of this issue while debug is on*

  f3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22
  2021-01-06 13:31:31.750 2858 INFO nova.osapi_compute.wsgi.server [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] 10.30.1.224 "POST /v2.1/os-server-external-events HTTP/1.1" status: 200 len: 1091 time: 0.4987640
  2021-01-06 13:31:40.145 2863 INFO nova.osapi_compute.wsgi.server [req-abaac9df-7338-4d10-9326-4006021ff54d 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1 HTTP/1.1" status: 302 len: 318 time: 0.0072701
  2021-01-06 13:31:40.156 2863 INFO nova.osapi_compute.wsgi.server [req-c393e74b-a118-4a98-8a83-be6007913dc0 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/ HTTP/1.1" status: 200 len: 789 time: 0.0070350
  2021-01-06 13:31:43.289 2865 INFO nova.osapi_compute.wsgi.server [req-b87268b7-a673-44c1-9162-f9564647ec33 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc HTTP/1.1" status: 200 len: 5654 time: 2.7543190
  2021-01-06 13:31:43.413 2863 INFO nova.osapi_compute.wsgi.server [req-4cab23ba-c5cb-4dda-bf42-bc452d004783 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc/os-volume_attachments HTTP/1.1" status: 200 len: 770 time: 0.1135709
  2021-01-06 13:31:43.883 2865 INFO nova.osapi_compute.wsgi.server [req-f5e5a586-65f3-4798-b03b-98e01326a00b 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/flavors/574a7152-f079-4337-b1eb-b7eca4370b73 HTTP/1.1" status: 200 len: 877 time: 0.5751688
  2021-01-06 13:31:47.194 2864 INFO nova.api.openstack.compute.server_external_events [req-7e639b1f-8408-4e8e-9bb8-54588290edfe 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22

  *I was asked to use the event list to find further logs. However, the
  event list did not return anything when I queried it. Could this be
  because the problematic instance was deleted and recreated?*

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1911474/+subscriptions


Follow ups