← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1665487] Re: Live migration tests sometimes timeout waiting for instance to be ACTIVE: "KVM: entry failed, hardware error 0x0"

 

** Changed in: openstack-gate
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1665487

Title:
  Live migration tests sometimes timeout waiting for instance to be
  ACTIVE: "KVM: entry failed, hardware error 0x0"

Status in OpenStack Compute (nova):
  Invalid
Status in OpenStack-Gate:
  Fix Released

Bug description:
  I've seen this a few times, but start tracking from here:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/console.html#_2017-02-16_22_25_43_645525

  2017-02-16 22:25:43.645525 | 2017-02-16 22:25:43.645 | tempest.api.compute.admin.test_live_migration.LiveBlockMigrationTestJSON.test_live_block_migration[id-1dce86b8-eb04-4c03-a9d8-9c1dc3ee0c7b]
  2017-02-16 22:25:43.646587 | 2017-02-16 22:25:43.646 | -------------------------------------------------------------------------------------------------------------------------------------------
  2017-02-16 22:25:43.647903 | 2017-02-16 22:25:43.647 | 
  2017-02-16 22:25:43.649473 | 2017-02-16 22:25:43.649 | Captured traceback:
  2017-02-16 22:25:43.651373 | 2017-02-16 22:25:43.650 | ~~~~~~~~~~~~~~~~~~~
  2017-02-16 22:25:43.656802 | 2017-02-16 22:25:43.656 |     Traceback (most recent call last):
  2017-02-16 22:25:43.659061 | 2017-02-16 22:25:43.658 |       File "tempest/api/compute/admin/test_live_migration.py", line 122, in test_live_block_migration
  2017-02-16 22:25:43.661006 | 2017-02-16 22:25:43.660 |         self._test_live_migration()
  2017-02-16 22:25:43.671547 | 2017-02-16 22:25:43.671 |       File "tempest/api/compute/admin/test_live_migration.py", line 97, in _test_live_migration
  2017-02-16 22:25:43.672889 | 2017-02-16 22:25:43.672 |         volume_backed=volume_backed)['id']
  2017-02-16 22:25:43.674405 | 2017-02-16 22:25:43.673 |       File "tempest/api/compute/base.py", line 232, in create_test_server
  2017-02-16 22:25:43.676559 | 2017-02-16 22:25:43.676 |         **kwargs)
  2017-02-16 22:25:43.681637 | 2017-02-16 22:25:43.680 |       File "tempest/common/compute.py", line 182, in create_test_server
  2017-02-16 22:25:43.683300 | 2017-02-16 22:25:43.682 |         server['id'])
  2017-02-16 22:25:43.684766 | 2017-02-16 22:25:43.684 |       File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2017-02-16 22:25:43.686210 | 2017-02-16 22:25:43.685 |         self.force_reraise()
  2017-02-16 22:25:43.687896 | 2017-02-16 22:25:43.687 |       File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2017-02-16 22:25:43.689493 | 2017-02-16 22:25:43.689 |         six.reraise(self.type_, self.value, self.tb)
  2017-02-16 22:25:43.691459 | 2017-02-16 22:25:43.691 |       File "tempest/common/compute.py", line 164, in create_test_server
  2017-02-16 22:25:43.693741 | 2017-02-16 22:25:43.692 |         clients.servers_client, server['id'], wait_until)
  2017-02-16 22:25:43.695699 | 2017-02-16 22:25:43.695 |       File "tempest/common/waiters.py", line 96, in wait_for_server_status
  2017-02-16 22:25:43.697692 | 2017-02-16 22:25:43.697 |         raise lib_exc.TimeoutException(message)
  2017-02-16 22:25:43.699378 | 2017-02-16 22:25:43.698 |     tempest.lib.exceptions.TimeoutException: Request timed out
  2017-02-16 22:25:43.701086 | 2017-02-16 22:25:43.700 |     Details: (LiveBlockMigrationTestJSON:test_live_block_migration) Server 0ee93807-d206-4ddf-878c-efd1dd2eab3c failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: BUILD. Current task state: spawning.

  I was looking in the n-cpu logs in the subnode for 0ee93807-d206-4ddf-
  878c-efd1dd2eab3c and found the last thing we see during the server
  create is here:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_143

  2017-02-16 22:22:11.143 14954 DEBUG nova.virt.libvirt.driver [req-
  025bf4d2-e5ba-4236-a334-f9eb98105ada tempest-
  LiveBlockMigrationTestJSON-793129345 tempest-
  LiveBlockMigrationTestJSON-793129345] [instance: 0ee93807-d206-4ddf-
  878c-efd1dd2eab3c] Instance is running spawn
  /opt/stack/new/nova/nova/virt/libvirt/driver.py:2689

  That's after we've created the domain:

  https://github.com/openstack/nova/blob/15.0.0.0rc1/nova/virt/libvirt/driver.py#L2689

  After that the driver is waiting for the power_state to go to RUNNING.

  I see that shortly after that log message we get a libvirt event
  saying the instance is started:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_214

  And then right after that it's paused:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/screen-n-cpu.txt.gz#_2017-02-16_22_22_11_263

  Looking in the QEMU logs for that instance:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz

  I see this, which is odd:

  KVM: entry failed, hardware error 0x0

  In the libvirtd logs I see:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/libvirt/libvirtd.txt.gz#_2017-02-16_22_22_10_879

  2017-02-16 22:22:10.879+0000: 8973: debug : qemuProcessLaunch:4876 :
  QEMU vm=0x7f68a40035a0 name=instance-00000001 running with pid=16410

  And later:

  2017-02-16 22:22:11.139+0000: 8968: debug : qemuProcessHandleResume:765 : Transitioned guest instance-00000001 out of paused into resumed state
  ...
  2017-02-16 22:22:11.140+0000: 8968: debug : qemuProcessHandleStop:716 : Transitioned guest instance-00000001 to paused state

  But I don't see any errors in the libvirt logs.

  Right around that time I see this in syslog:

  http://logs.openstack.org/50/435050/1/check/gate-tempest-dsvm-
  multinode-live-migration-ubuntu-
  xenial/49adcde/logs/subnode-2/syslog.txt.gz#_Feb_16_22_22_11

  Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: libvirt version: 1.3.1, package: 1ubuntu10.8 (Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> Mon, 06 Feb 2017 14:30:46 +0100)
  Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: hostname: ubuntu-xenial-2-node-ovh-bhs1-7345556-429938
  Feb 16 22:22:11 ubuntu-xenial-2-node-ovh-bhs1-7345556-429938 virtlogd[8995]: End of file while reading data: Input/output error

  I guess the virtlogd i/o error is pretty normal, it's all over syslogs
  in our CI runs, and maybe is a result of the hw error in the guest?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1665487/+subscriptions


References