← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1745073] [NEW] Nova Compute unintentionally stopped to monitor live migration

 

Public bug reported:

Description
===========

There is the case that nova-compute unintentionally stop to monitor live migration although live migration operation thread is still running (_live_migration_operation).
This cause the problem that nova-compute result in reporting "migration was succeeded" to Nova conductor and Nova compute periodic task try to delete all instance related information inside /var/lib/nova/instances/<instance-id> because live migration was succeeded from nova point of view.
This could cause the problem of live migration and also this is led to misunderstanding for the status of live migration operation to the operator.

"So it must be better at least Nova compute monitor live migration
during _live_migration_operation thread be running"

Above case won't happen usually as long as libvirtd correctly maintain domain job information and correctly clean up after job is completed even if nova-compute won't check if live migration operation thread is finished or not.
But If libvirtd couldn't maintain domain job information correctly or something happened in clean up phase, nova-compute could misunderstand live migration as successful although still in progress "obviously" because the _live_migration_operation thread is running.

We could think it as just libvirtd matter and Nova doesn't have to take care of these.
But I think it must be better to implement more safety way if we can take it and actually I faced this situation with libvirtd 3.2.0 and it took a bit time to notice live migration operation thread is never finished from log and migration status in the database.

More specifically, I think here (finish_event) should be always checked not only the time when job type is VIR_DOMAIN_JOB_NONE
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6871

libvirtd side problem is already fixed by
https://www.redhat.com/archives/libvir-list/2017-April/msg00387.html and
this is included in 3.3.0, but still I think nova compute should change
behaviour for future problem that could be happened

Steps to reproduce
==================

* *Use libvirtd-3.2.0 having bug related to live migration*
   -> This version of libvirtd would often (not always) block for ever in virDomainMigrateToURI3 method and cause _live_migration_operation thrad is running for ever 

* Create test vm with swap disk

   ```
   $ openstack flavor create --ram 1024 --disk 20 --swap 4048 --vcpus 1 test
   +----------------------------+--------------------------------------+
   | Field                      | Value                                |
   +----------------------------+--------------------------------------+
   | disk                       | 20                                   |
   | id                         | d4e400a7-fd10-4c18-9dbc-f89f24e668af |
   | name                       | test                                 |
   | os-flavor-access:is_public | True                                 |
   | ram                        | 1024                                 |
   | rxtx_factor                | 1.0                                  |
   | swap                       | 4048                                 |
   | vcpus                      | 1                                    |
   +----------------------------+--------------------------------------+
   ```
   
   ```
   $ openstack server create --flavor test --image <something image> --nic net-id=<something network> test_server
   ```


* Nova "block" live migration test vm from HV1 to HV2

   ```
   $ nova live-migration --block-migrate test_server HV2
   ```

* Check migration status

   ```
   $ nova migration-list
   +-----+-----------------------+-----------------------+----------------+--------------+-------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+
   | Id  | Source Node           | Dest Node             | Source Compute | Dest Compute | Dest Host   | Status    | Instance UUID                        | Old Flavor | New Flavor | Created At                 | Updated At                 | Type           |
   +-----+-----------------------+-----------------------+----------------+--------------+-------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+
   | 1   | -                     | -                     | HV1    | HV2  | -           | completed     | e484eb18-2794-4651-a357-d2070940ed32 | 6          | 6          | 2018-01-09T03:02:10.000000 | 2018-01-09T03:02:20.000000 | live-migration |
   ```

* Check vm status

```
$ nova list
+--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
| ID                                   | Name                     | Status | Task State | Power State | Networks           |
+--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
| a221c19b-4d4e-46d4-8888-10c14ca0fe27 | test_server              | ACTIVE | -          | Paused      | net1=192.168.11.11 |
+--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
```


Expected result
===============

The host running nova-api/nova-conductor
 * migration status should not be changed to "completed" until _live_migration_operation thread finish
 * VM status should not be changed to "ACTIVE" until _live_migration_operation thread finish

The host running nova-compute
 * continue on monitoring live migration during _live_migration_operation thread being running 


Actual result
=============

The host running nova-api
* migration status was changed to "completed" in nova database (check by nova migration-list)
* VM status was changed to "ACTIVE" (check by nova list)

The host running nova-compute
* stop to monitor live migration although _live_migration_operation thread is still running  (check by log displaying "Live migration monitoring is all done")


Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/
   * 13.1.0-1.el7 (Centos7)
			
2. Which hypervisor did you use?
   * libvirt + KVM 
       * libvirt-daemon: 3.2.0-14.el7_4.7 
       * qemu-kvm: 2.6.0-28.el7.10.1

2. Which storage type did you use?
   * local storage (just ephemeral disk)

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1745073

Title:
  Nova Compute unintentionally stopped to monitor live migration

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  There is the case that nova-compute unintentionally stop to monitor live migration although live migration operation thread is still running (_live_migration_operation).
  This cause the problem that nova-compute result in reporting "migration was succeeded" to Nova conductor and Nova compute periodic task try to delete all instance related information inside /var/lib/nova/instances/<instance-id> because live migration was succeeded from nova point of view.
  This could cause the problem of live migration and also this is led to misunderstanding for the status of live migration operation to the operator.

  "So it must be better at least Nova compute monitor live migration
  during _live_migration_operation thread be running"

  Above case won't happen usually as long as libvirtd correctly maintain domain job information and correctly clean up after job is completed even if nova-compute won't check if live migration operation thread is finished or not.
  But If libvirtd couldn't maintain domain job information correctly or something happened in clean up phase, nova-compute could misunderstand live migration as successful although still in progress "obviously" because the _live_migration_operation thread is running.

  We could think it as just libvirtd matter and Nova doesn't have to take care of these.
  But I think it must be better to implement more safety way if we can take it and actually I faced this situation with libvirtd 3.2.0 and it took a bit time to notice live migration operation thread is never finished from log and migration status in the database.

  More specifically, I think here (finish_event) should be always checked not only the time when job type is VIR_DOMAIN_JOB_NONE
  https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6871

  libvirtd side problem is already fixed by
  https://www.redhat.com/archives/libvir-list/2017-April/msg00387.html
  and this is included in 3.3.0, but still I think nova compute should
  change behaviour for future problem that could be happened

  Steps to reproduce
  ==================

  * *Use libvirtd-3.2.0 having bug related to live migration*
     -> This version of libvirtd would often (not always) block for ever in virDomainMigrateToURI3 method and cause _live_migration_operation thrad is running for ever 

  * Create test vm with swap disk

     ```
     $ openstack flavor create --ram 1024 --disk 20 --swap 4048 --vcpus 1 test
     +----------------------------+--------------------------------------+
     | Field                      | Value                                |
     +----------------------------+--------------------------------------+
     | disk                       | 20                                   |
     | id                         | d4e400a7-fd10-4c18-9dbc-f89f24e668af |
     | name                       | test                                 |
     | os-flavor-access:is_public | True                                 |
     | ram                        | 1024                                 |
     | rxtx_factor                | 1.0                                  |
     | swap                       | 4048                                 |
     | vcpus                      | 1                                    |
     +----------------------------+--------------------------------------+
     ```
     
     ```
     $ openstack server create --flavor test --image <something image> --nic net-id=<something network> test_server
     ```

  
  * Nova "block" live migration test vm from HV1 to HV2

     ```
     $ nova live-migration --block-migrate test_server HV2
     ```

  * Check migration status

     ```
     $ nova migration-list
     +-----+-----------------------+-----------------------+----------------+--------------+-------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+
     | Id  | Source Node           | Dest Node             | Source Compute | Dest Compute | Dest Host   | Status    | Instance UUID                        | Old Flavor | New Flavor | Created At                 | Updated At                 | Type           |
     +-----+-----------------------+-----------------------+----------------+--------------+-------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+
     | 1   | -                     | -                     | HV1    | HV2  | -           | completed     | e484eb18-2794-4651-a357-d2070940ed32 | 6          | 6          | 2018-01-09T03:02:10.000000 | 2018-01-09T03:02:20.000000 | live-migration |
     ```

  * Check vm status

  ```
  $ nova list
  +--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
  | ID                                   | Name                     | Status | Task State | Power State | Networks           |
  +--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
  | a221c19b-4d4e-46d4-8888-10c14ca0fe27 | test_server              | ACTIVE | -          | Paused      | net1=192.168.11.11 |
  +--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
  ```

  
  Expected result
  ===============

  The host running nova-api/nova-conductor
   * migration status should not be changed to "completed" until _live_migration_operation thread finish
   * VM status should not be changed to "ACTIVE" until _live_migration_operation thread finish

  The host running nova-compute
   * continue on monitoring live migration during _live_migration_operation thread being running 

  
  Actual result
  =============

  The host running nova-api
  * migration status was changed to "completed" in nova database (check by nova migration-list)
  * VM status was changed to "ACTIVE" (check by nova list)

  The host running nova-compute
  * stop to monitor live migration although _live_migration_operation thread is still running  (check by log displaying "Live migration monitoring is all done")

  
  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/
     * 13.1.0-1.el7 (Centos7)
  			
  2. Which hypervisor did you use?
     * libvirt + KVM 
         * libvirt-daemon: 3.2.0-14.el7_4.7 
         * qemu-kvm: 2.6.0-28.el7.10.1

  2. Which storage type did you use?
     * local storage (just ephemeral disk)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1745073/+subscriptions


Follow ups