← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1535918] Re: instance.host not updated on evacuation

 

I am able to reproduce this same issue on a multinode devstack running
libvirt.

On the source host, the last call to
nova/network/base_api.py::update_instance_cache_with_nw_info for a
specific instance before the source host crashes has the nw_info passed
in as a VIF object with the "active" attribute set to False. This is
because the VM has just been deployed and the network was just created.
In other words, the last time the instance's InstanceInfoCache's
network_info attribute was updated before the source host went down, the
VIF was considered not active. In some environments, especially when
doing concurrent deploys, it may take a while for the InstanceInfoCache
to update the network_info to show as active.

What this boils down to is that Nova's InstanceInfoCache can potentially
have a stale network_info active state. This causes the rebuild flow
(which is the same as the spawn flow) to potentially end up waiting for
the network-vif-plugged event, which will never come because it was sent
to the source host instead of the destination. This results in the
rebuild to fail because the VIF plugging times out.

Steps:

1) Deploy VM(s) to host A
2) Take host A down (e.g., kill it's nova api and nova compute processes) once VM(s) from (1) are finished deploying
3) Try to evacuate VM(s) from host A to host B
4) Evacuation will potentially time out based on explanation above. It is much easier to reproduce if you do step (2) as soon as possible after the VM(s) finish deploying

stack@controller:~$ glance image-list
+--------------------------------------+---------------------------------+
| ID                                   | Name                            |
+--------------------------------------+---------------------------------+
| f91197db-16b5-44b2-beb4-72a9e57041c2 | cirros-0.3.4-x86_64-uec         |
| 1348de9b-501d-426c-8cb5-e65381208085 | cirros-0.3.4-x86_64-uec-kernel  |
| 790ebadb-bc5b-48be-b1f0-95a9214a11ae | cirros-0.3.4-x86_64-uec-ramdisk |
+--------------------------------------+---------------------------------+
stack@controller:~$
stack@controller:~$ neutron net-list
+--------------------------------------+---------+----------------------------------------------------------+
| id                                   | name    | subnets                                                  |
+--------------------------------------+---------+----------------------------------------------------------+
| 4ba74a3e-e7a8-4ca4-9de5-8a1d9e1042b8 | public  | c9210289-4895-481b-946a-b406ba5889b4 2001:db8::/64       |
|                                      |         | 9a044095-ab4d-4767-817e-02d81cbe90ef 172.24.4.0/24       |
| d7faf346-1a26-41a0-bb62-b08808f6ba13 | private | f45ab890-a0d6-48c1-906e-9c8f81659d65 fdfd:f0f5:a83a::/64 |
|                                      |         | 0e85f797-0270-49e9-9600-6f21b9cf47d0 10.254.1.0/24       |
+--------------------------------------+---------+----------------------------------------------------------+
stack@controller:~$
stack@controller:~$ nova boot tdp-test-vm --flavor 1 --availability-zone nova:hostA --block-device id=f91197db-16b5-44b2-beb4-72a9e57041c2,source=image,dest=volume,size=1,bootindex=0 --nic net-id=4ba74a3e-e7a8-4ca4-9de5-8a1d9e1042b8 --min-count 5 --poll
+--------------------------------------+-------------------------------------------------+
| Property                             | Value                                           |
+--------------------------------------+-------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                          |
| OS-EXT-AZ:availability_zone          | nova                                            |
| OS-EXT-SRV-ATTR:host                 | -                                               |
| OS-EXT-SRV-ATTR:hostname             | tdp-test-vm-1                                   |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | -                                               |
| OS-EXT-SRV-ATTR:instance_name        | instance-00000021                               |
| OS-EXT-SRV-ATTR:kernel_id            | 1348de9b-501d-426c-8cb5-e65381208085            |
| OS-EXT-SRV-ATTR:launch_index         | 0                                               |
| OS-EXT-SRV-ATTR:ramdisk_id           | 790ebadb-bc5b-48be-b1f0-95a9214a11ae            |
| OS-EXT-SRV-ATTR:reservation_id       | r-erf2jgt0                                      |
| OS-EXT-SRV-ATTR:root_device_name     | -                                               |
| OS-EXT-SRV-ATTR:user_data            | -                                               |
| OS-EXT-STS:power_state               | 0                                               |
| OS-EXT-STS:task_state                | scheduling                                      |
| OS-EXT-STS:vm_state                  | building                                        |
| OS-SRV-USG:launched_at               | -                                               |
| OS-SRV-USG:terminated_at             | -                                               |
| accessIPv4                           |                                                 |
| accessIPv6                           |                                                 |
| adminPass                            | YvcgM3bNF7TH                                    |
| config_drive                         |                                                 |
| created                              | 2016-05-16T01:55:53Z                            |
| description                          | -                                               |
| flavor                               | m1.tiny (1)                                     |
| hostId                               |                                                 |
| host_status                          |                                                 |
| id                                   | 2a99f5b4-f060-4e3e-8799-f021bca2b056            |
| image                                | Attempt to boot from volume - no image supplied |
| key_name                             | -                                               |
| locked                               | False                                           |
| metadata                             | {}                                              |
| name                                 | tdp-test-vm-1                                   |
| os-extended-volumes:volumes_attached | []                                              |
| progress                             | 0                                               |
| security_groups                      | default                                         |
| status                               | BUILD                                           |
| tenant_id                            | 2794951c7a194b7d8a5047dc69882a14                |
| updated                              | 2016-05-16T01:55:56Z                            |
| user_id                              | aae1397168124897b2065d7bed9da4e2                |
+--------------------------------------+-------------------------------------------------+

Server building... 100% complete
Finished
stack@controller:~$
stack@controller:~$ nova list
+--------------------------------------+---------------+--------+------------+-------------+----------------------------------+
| ID                                   | Name          | Status | Task State | Power State | Networks                         |
+--------------------------------------+---------------+--------+------------+-------------+----------------------------------+
| 2a99f5b4-f060-4e3e-8799-f021bca2b056 | tdp-test-vm-1 | ACTIVE | -          | Running     | public=2001:db8::21, 172.24.4.33 |
| 8ecbe8ad-1e2f-4017-b8ad-db7e2b98e785 | tdp-test-vm-2 | ACTIVE | -          | Running     | public=172.24.4.34, 2001:db8::22 |
| 9b149ed3-7dd6-46ba-9465-11aae7293fed | tdp-test-vm-3 | ACTIVE | -          | Running     | public=2001:db8::23, 172.24.4.35 |
| 5d989262-d8b0-4b26-8121-976856d43524 | tdp-test-vm-4 | ACTIVE | -          | Running     | public=172.24.4.37, 2001:db8::25 |
| eae08f65-ac33-4599-ba61-3868cba847d9 | tdp-test-vm-5 | ACTIVE | -          | Running     | public=172.24.4.36, 2001:db8::24 |
+--------------------------------------+---------------+--------+------------+-------------+----------------------------------+
stack@controller:~$
stack@controller:~$ nova hypervisor-list
+----+---------------------+-------+---------+
| ID | Hypervisor hostname | State | Status  |
+----+---------------------+-------+---------+
| 1  | hostA       | down  | enabled |
| 2  | hostB       | up    | enabled |
+----+---------------------+-------+---------+
stack@controller:~$
stack@controller:~$ nova evacuate 2a99f5b4-f060-4e3e-8799-f021bca2b056 hostB
stack@controller:~$

stack@hostB:~$ grep -A17 "ERROR nova.compute.manager.*Setting instance vm_state to ERROR" ~/logs/n-cpu.log
2016-05-15 22:27:21.006 ERROR nova.compute.manager [req-f56f252f-ed1d-4e47-9401-25cd5bce74aa admin admin] [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056] Setting instance vm_state to ERROR
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056] Traceback (most recent call last):
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]   File "/opt/stack/nova/nova/compute/manager.py", line 6434, in _error_out_instance_on_exception
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]     yield
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]   File "/opt/stack/nova/nova/compute/manager.py", line 2633, in rebuild_instance
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]     bdms, recreate, on_shared_storage, preserve_ephemeral)
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]   File "/opt/stack/nova/nova/compute/manager.py", line 2677, in _do_rebuild_instance_with_claim
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]     self._do_rebuild_instance(*args, **kwargs)
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]   File "/opt/stack/nova/nova/compute/manager.py", line 2793, in _do_rebuild_instance
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]     self._rebuild_default_impl(**kwargs)
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]   File "/opt/stack/nova/nova/compute/manager.py", line 2558, in _rebuild_default_impl
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]     block_device_info=new_block_device_info)
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 2569, in spawn
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]     block_device_info=block_device_info)
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]   File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 4738, in _create_domain_and_network
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]     raise exception.VirtualInterfaceCreateException()
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056] VirtualInterfaceCreateException: Virtual Interface creation failed
2016-05-15 22:27:21.006 TRACE nova.compute.manager [instance: 2a99f5b4-f060-4e3e-8799-f021bca2b056]
stack@hostB:~$


** Project changed: networking-powervm => nova

** Changed in: nova
     Assignee: Drew Thorstensen (thorst) => (unassigned)

** Project changed: nova => nova-powervm

** Changed in: nova-powervm
     Assignee: (unassigned) => Drew Thorstensen (thorst)

** Also affects: nova
   Importance: Undecided
       Status: New

** Changed in: nova
     Assignee: (unassigned) => Taylor Peoples (tpeoples)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1535918

Title:
  instance.host not updated on evacuation

Status in OpenStack Compute (nova):
  New
Status in nova-powervm:
  Fix Released

Bug description:
  I'm working on the nova-powervm driver for Mitaka and trying to add
  support for evacuation.

  The problem I'm hitting is that instance.host is not updated when the
  compute driver is called to spawn the instance on the destination
  host.  It is still set to the source host.  It's not until after the
  spawn completes that the compute manager updates instance.host to
  reflect the destination host.

  The nova-powervm driver uses instance events callback mechanism during
  plug VIF to determine when Neutron has finished provisioning the
  network.  The instance events code sends the event to instance.host
  and hence is sending the event to the source host (which is down).
  This causes the spawn to fail and also causes weirdness when the
  source host gets the events when it's powered back up.

  To temporarily work around the problem, I hacked in setting
  instance.host = CONF.host; instance.save() in the compute driver but
  that's not a good solution.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1535918/+subscriptions


References