yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #61591
[Bug 1665487] Re: Live migration tests sometimes timeout waiting for instance to be ACTIVE: "KVM: entry failed, hardware error 0x0"
The fix is here: https://review.openstack.org/#/c/435154/
** Also affects: openstack-gate
Importance: Undecided
Status: New
** Changed in: openstack-gate
Status: New => In Progress
** Changed in: openstack-gate
Assignee: (unassigned) => Matt Riedemann (mriedem)
** Changed in: nova
Status: Triaged => Invalid
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1665487
Title:
Live migration tests sometimes timeout waiting for instance to be
ACTIVE: "KVM: entry failed, hardware error 0x0"
Status in OpenStack Compute (nova):
Invalid
Status in OpenStack-Gate:
In Progress
Bug description:
I've seen this a few times, but start tracking from here:
http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
multinode-live-migration-ubuntu-
xenial/49adcde/console.html#_2017-02-16_22_25_43_645525
2017-02-16 22:25:43.645525 | 2017-02-16 22:25:43.645 | tempest.api.compute.admin.test_live_migration.LiveBlockMigrationTestJSON.test_live_block_migration[id-1dce86b8-eb04-4c03-a9d8-9c1dc3ee0c7b]
2017-02-16 22:25:43.646587 | 2017-02-16 22:25:43.646 | -------------------------------------------------------------------------------------------------------------------------------------------
2017-02-16 22:25:43.647903 | 2017-02-16 22:25:43.647 |
2017-02-16 22:25:43.649473 | 2017-02-16 22:25:43.649 | Captured traceback:
2017-02-16 22:25:43.651373 | 2017-02-16 22:25:43.650 | ~~~~~~~~~~~~~~~~~~~
2017-02-16 22:25:43.656802 | 2017-02-16 22:25:43.656 | Traceback (most recent call last):
2017-02-16 22:25:43.659061 | 2017-02-16 22:25:43.658 | File "tempest/api/compute/admin/test_live_migration.py", line 122, in test_live_block_migration
2017-02-16 22:25:43.661006 | 2017-02-16 22:25:43.660 | self._test_live_migration()
2017-02-16 22:25:43.671547 | 2017-02-16 22:25:43.671 | File "tempest/api/compute/admin/test_live_migration.py", line 97, in _test_live_migration
2017-02-16 22:25:43.672889 | 2017-02-16 22:25:43.672 | volume_backed=volume_backed)['id']
2017-02-16 22:25:43.674405 | 2017-02-16 22:25:43.673 | File "tempest/api/compute/base.py", line 232, in create_test_server
2017-02-16 22:25:43.676559 | 2017-02-16 22:25:43.676 | **kwargs)
2017-02-16 22:25:43.681637 | 2017-02-16 22:25:43.680 | File "tempest/common/compute.py", line 182, in create_test_server
2017-02-16 22:25:43.683300 | 2017-02-16 22:25:43.682 | server['id'])
2017-02-16 22:25:43.684766 | 2017-02-16 22:25:43.684 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2017-02-16 22:25:43.686210 | 2017-02-16 22:25:43.685 | self.force_reraise()
2017-02-16 22:25:43.687896 | 2017-02-16 22:25:43.687 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2017-02-16 22:25:43.689493 | 2017-02-16 22:25:43.689 | six.reraise(self.type_, self.value, self.tb)
2017-02-16 22:25:43.691459 | 2017-02-16 22:25:43.691 | File "tempest/common/compute.py", line 164, in create_test_server
2017-02-16 22:25:43.693741 | 2017-02-16 22:25:43.692 | clients.servers_client, server['id'], wait_until)
2017-02-16 22:25:43.695699 | 2017-02-16 22:25:43.695 | File "tempest/common/waiters.py", line 96, in wait_for_server_status
2017-02-16 22:25:43.697692 | 2017-02-16 22:25:43.697 | raise lib_exc.TimeoutException(message)
2017-02-16 22:25:43.699378 | 2017-02-16 22:25:43.698 | tempest.lib.exceptions.TimeoutException: Request timed out
2017-02-16 22:25:43.701086 | 2017-02-16 22:25:43.700 | Details: (LiveBlockMigrationTestJSON:test_live_block_migration) Server 0ee93807-d206-4ddf-878c-efd1dd2eab3c failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: BUILD. Current task state: spawning.
I was looking in the n-cpu logs in the subnode for 0ee93807-d206-4ddf-
878c-efd1dd2eab3c and found the last thing we see during the server
create is here:
http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
multinode-live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_143
2017-02-16 22:22:11.143 14954 DEBUG nova.virt.libvirt.driver [req-
025bf4d2-e5ba-4236-a334-f9eb98105ada tempest-
LiveBlockMigrationTestJSON-793129345 tempest-
LiveBlockMigrationTestJSON-793129345] [instance: 0ee93807-d206-4ddf-
878c-efd1dd2eab3c] Instance is running spawn
/opt/stack/new/nova/nova/virt/libvirt/driver.py:2689
That's after we've created the domain:
https://github.com/openstack/nova/blob/15.0.0.0rc1/nova/virt/libvirt/driver.py#L2689
After that the driver is waiting for the power_state to go to RUNNING.
I see that shortly after that log message we get a libvirt event
saying the instance is started:
http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
multinode-live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_214
And then right after that it's paused:
http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
multinode-live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_263
Looking in the QEMU logs for that instance:
http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
multinode-live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz
I see this, which is odd:
KVM: entry failed, hardware error 0x0
In the libvirtd logs I see:
http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
multinode-live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/libvirt/libvirtd.txt.gz#_2017-02-16_22_22_10_879
2017-02-16 22:22:10.879+0000: 8973: debug : qemuProcessLaunch:4876 :
QEMU vm=0x7f68a40035a0 name=instance-00000001 running with pid=16410
And later:
2017-02-16 22:22:11.139+0000: 8968: debug : qemuProcessHandleResume:765 : Transitioned guest instance-00000001 out of paused into resumed state
...
2017-02-16 22:22:11.140+0000: 8968: debug : qemuProcessHandleStop:716 : Transitioned guest instance-00000001 to paused state
But I don't see any errors in the libvirt logs.
Right around that time I see this in syslog:
http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
multinode-live-migration-ubuntu-
xenial/49adcde/logs/subnode-2/syslog.txt.gz#_Feb_16_22_22_11
Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: libvirt version: 1.3.1, package: 1ubuntu10.8 (Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> Mon, 06 Feb 2017 14:30:46 +0100)
Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: hostname: ubuntu-xenial-2-node-ovh-bhs1-7345556-429938
Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: End of file while reading data: Input/output error
I guess the virtlogd i/o error is pretty normal, it's all over syslogs
in our CI runs, and maybe is a result of the hw error in the guest?
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1665487/+subscriptions
References