← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1799152] [NEW] Retry after hitting libvirt error code VIR_ERR_OPERATION_INVALID in live migration.

 

Public bug reported:

Description
===========
When migration of a persistent guest completes, the guest merely shuts off,
but libvirt unhelpfully raises an VIR_ERR_OPERATION_INVALID error code, in the
nova code, we pretend this case means success. But if we are in the middle of a
live migration, and sadly qemu-kvm process is killed accidentally, such as by host OOM, which happens rarely in our environment but it does happen few 
times, domain state is SHUTOFF and then we will get VIR_ERR_OPERATION_INVALID
while trying to call `self._domain.jobStats()`. Under the circumstance,
migration should be considered failed, otherwise post_live_migration() function
starts to clean up instance files and we will lose customers' data forever.
IMHO, we may need to `pretend` the migration job is still running after
hitting VIR_ERR_OPERATION_INVALID and retry to get job stats for a few times,
which the count of retries can be configured. Because if migration succeeds
finally, we won't get VIR_ERR_OPERATION_INVALID after some retries, but the error code still happens if qemu-kvm process is killed accidentally.

Steps to reproduce
==================
* Do nova live-migration <uuid> on controller node.
* Once live migration monitor on source compute node starts to get JobInfo, kill the qemu-kvm process on source host.
* Check if post_live_migration on source host starts to execute.
* Check if post_live_migration on destination host starts to execute.
* Check image files on both source host and destination host.

Expected result
===============

Migration should be consider failed.

Actual result
=============

Post live migration on source host starts to execute and clean instance
files. Instance disappears on both source and destination host.

Environment
===========
1. My environment is packstack, and openstack nova release is Queens.

2. Libvirt + KVM

Logs & Configs
==============

Some logs after qemu-kvm process is killed.
```
...
2018-09-21 14:08:34.180 11099 DEBUG nova.virt.libvirt.migration [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Downtime does not need to change update_downtime /usr/lib/python2.7/site-packages/nova/virt/libvirt/migration.py:410
2018-09-21 14:08:34.305 11099 DEBUG nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:7394
2018-09-21 14:08:34.886 11099 DEBUG nova.virt.libvirt.guest [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] Domain has shutdown/gone away: Requested operation is not valid: domain is not running get_job_info /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:720
2018-09-21 14:08:34.887 11099 INFO nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration operation has completed
2018-09-21 14:08:34.887 11099 INFO nova.compute.manager [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] _post_live_migration() is started..
...
```

** Affects: nova
     Importance: Undecided
     Assignee: Fan Zhang (fanzhang)
         Status: New

** Changed in: nova
     Assignee: (unassigned) => Fan Zhang (fanzhang)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1799152

Title:
  Retry after hitting libvirt error code VIR_ERR_OPERATION_INVALID in
  live migration.

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When migration of a persistent guest completes, the guest merely shuts off,
  but libvirt unhelpfully raises an VIR_ERR_OPERATION_INVALID error code, in the
  nova code, we pretend this case means success. But if we are in the middle of a
  live migration, and sadly qemu-kvm process is killed accidentally, such as by host OOM, which happens rarely in our environment but it does happen few 
  times, domain state is SHUTOFF and then we will get VIR_ERR_OPERATION_INVALID
  while trying to call `self._domain.jobStats()`. Under the circumstance,
  migration should be considered failed, otherwise post_live_migration() function
  starts to clean up instance files and we will lose customers' data forever.
  IMHO, we may need to `pretend` the migration job is still running after
  hitting VIR_ERR_OPERATION_INVALID and retry to get job stats for a few times,
  which the count of retries can be configured. Because if migration succeeds
  finally, we won't get VIR_ERR_OPERATION_INVALID after some retries, but the error code still happens if qemu-kvm process is killed accidentally.

  Steps to reproduce
  ==================
  * Do nova live-migration <uuid> on controller node.
  * Once live migration monitor on source compute node starts to get JobInfo, kill the qemu-kvm process on source host.
  * Check if post_live_migration on source host starts to execute.
  * Check if post_live_migration on destination host starts to execute.
  * Check image files on both source host and destination host.

  Expected result
  ===============

  Migration should be consider failed.

  Actual result
  =============

  Post live migration on source host starts to execute and clean
  instance files. Instance disappears on both source and destination
  host.

  Environment
  ===========
  1. My environment is packstack, and openstack nova release is Queens.

  2. Libvirt + KVM

  Logs & Configs
  ==============

  Some logs after qemu-kvm process is killed.
  ```
  ...
  2018-09-21 14:08:34.180 11099 DEBUG nova.virt.libvirt.migration [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Downtime does not need to change update_downtime /usr/lib/python2.7/site-packages/nova/virt/libvirt/migration.py:410
  2018-09-21 14:08:34.305 11099 DEBUG nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:7394
  2018-09-21 14:08:34.886 11099 DEBUG nova.virt.libvirt.guest [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] Domain has shutdown/gone away: Requested operation is not valid: domain is not running get_job_info /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:720
  2018-09-21 14:08:34.887 11099 INFO nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration operation has completed
  2018-09-21 14:08:34.887 11099 INFO nova.compute.manager [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] _post_live_migration() is started..
  ...
  ```

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1799152/+subscriptions