← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2065403] [NEW] Online data migrations fail to execute correctly when upgrading from 2023.1

 

Public bug reported:

Description
===========
When executing online_data_migrations during an upgrade from 2023.1 to 2023.2 or beyond, the 'populate_instance_compute_id' method misses numerous relevant migrations when run against a moderately sized deployment.

The issue is that by default a limit (max-count) of 50 records is used
by nova-manage in the migration, but as numerous records are unsuitable
for migration (where there is no existing node ID), eventually the first
50 records returned by the database query are irrelevant, and the
migration exits as if it has completed, even though many relevant
records remain to be migrated.

A secondary issue is that this query approach causes the migration
method to be executed many hundreds or thousands of times more than
necessary as on every iteration it has to ignore the same irrelevant
records. This takes a long time. In the deployment I've just upgraded
the migration took upwards of 15 minutes before exiting.

My suspicion is that the query in
https://opendev.org/openstack/nova/blame/commit/7096423b343ffce9622fd078fc2b3a87fd3386f7/nova/objects/instance.py#L1359
should really be filtering out records without a 'node' (or 'host')
entry to avoid the need for internal exception handling, but there may
be some other reason this wasn't done initially.

Steps to reproduce
==================
Using a moderately sized database (a few tens of thousands of records).
* Perform an upgrade from 2023.1 to 2023.2
* Execute 'nova-manage db online_data_migrations'

Expected result
===============
Migrations complete in a reasonable time, with all relevant records migrated.

Actual result
=============
nova-manage exits after a long period of time with an apparent success, but in reality many records remain un-migrated.

In the two deployments we have migrated to date, we are left with the
following apparently relevant records which should have been migrated
but haven't been:

MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
+----------+
| count(*) |
+----------+
|    29147 |
+----------+
1 row in set (0.045 sec)

MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
+----------+
| count(*) |
+----------+
|    22622 |
+----------+
1 row in set (0.048 sec)

Environment
===========
Nova 45a926156c863b468318cce462a21027685d07a6 (upgraded from 2023.1)
Libvirt+KVM
Ceph
Neutron+LXB

Logs & Configs
==============
During nova-manage execution, log messages such as the following are printed:

50 rows matched query populate_instance_compute_id, 6 migrated
50 rows matched query populate_instance_compute_id, 6 migrated
50 rows matched query populate_instance_compute_id, 6 migrated
50 rows matched query populate_instance_compute_id, 6 migrated
...
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 1 migrated
50 rows matched query populate_instance_compute_id, 0 migrated
+-------------------------------------+--------------+-----------+
|              Migration              | Total Needed | Completed |
+-------------------------------------+--------------+-----------+
|     fill_virtual_interface_list     |      0       |     0     |
|         migrate_empty_ratio         |      0       |     0     |
|   migrate_quota_classes_to_api_db   |      0       |     0     |
|    migrate_quota_limits_to_api_db   |      0       |     0     |
|      migration_migrate_to_uuid      |      0       |     0     |
|          populate_dev_uuids         |      0       |     0     |
|     populate_instance_compute_id    |    54300     |    1667   |
| populate_missing_availability_zones |      0       |     0     |
|      populate_queued_for_delete     |      0       |     0     |
|           populate_user_id          |      0       |     0     |
|            populate_uuids           |      0       |     0     |
+-------------------------------------+--------------+-----------+

Note the repeating number of migrations, indicating that the first 44,
then 49 records are irrelevant for migration (triggering
https://opendev.org/openstack/nova/blame/commit/7096423b343ffce9622fd078fc2b3a87fd3386f7/nova/objects/instance.py#L1369).
Also note that the total needed figure is erroneous as this contains
duplicate counts of these irrelevant records each time the method is
called.

A further run of the migration after the above completion shows:

50 rows matched query populate_instance_compute_id, 0 migrated
+-------------------------------------+--------------+-----------+
|              Migration              | Total Needed | Completed |
+-------------------------------------+--------------+-----------+
|     fill_virtual_interface_list     |      0       |     0     |
|         migrate_empty_ratio         |      0       |     0     |
|   migrate_quota_classes_to_api_db   |      0       |     0     |
|    migrate_quota_limits_to_api_db   |      0       |     0     |
|      migration_migrate_to_uuid      |      0       |     0     |
|          populate_dev_uuids         |      0       |     0     |
|     populate_instance_compute_id    |      50      |     0     |
| populate_missing_availability_zones |      0       |     0     |
|      populate_queued_for_delete     |      0       |     0     |
|           populate_user_id          |      0       |     0     |
|            populate_uuids           |      0       |     0     |
+-------------------------------------+--------------+-----------+

If you then increase the --max-count parameter, further migrations will
proceed:

100 rows matched query populate_instance_compute_id, 44 migrated
+-------------------------------------+--------------+-----------+
|              Migration              | Total Needed | Completed |
+-------------------------------------+--------------+-----------+
|     fill_virtual_interface_list     |      0       |     0     |
|         migrate_empty_ratio         |      0       |     0     |
|   migrate_quota_classes_to_api_db   |      0       |     0     |
|    migrate_quota_limits_to_api_db   |      0       |     0     |
|      migration_migrate_to_uuid      |      0       |     0     |
|          populate_dev_uuids         |      0       |     0     |
|     populate_instance_compute_id    |     100      |     44    |
| populate_missing_availability_zones |      0       |     0     |
|      populate_queued_for_delete     |      0       |     0     |
|           populate_user_id          |      0       |     0     |
|            populate_uuids           |      0       |     0     |
+-------------------------------------+--------------+-----------+

In our deployment databases we have the following records which would
likely trigger this issue:

MariaDB [nova]> select count(*) from instances where node is null;
+----------+
| count(*) |
+----------+
|      251 |
+----------+
1 row in set (0.029 sec)

MariaDB [nova]> select count(*) from instances where node is null;
+----------+
| count(*) |
+----------+
|      141 |
+----------+
1 row in set (0.039 sec)

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2065403

Title:
  Online data migrations fail to execute correctly when upgrading from
  2023.1

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When executing online_data_migrations during an upgrade from 2023.1 to 2023.2 or beyond, the 'populate_instance_compute_id' method misses numerous relevant migrations when run against a moderately sized deployment.

  The issue is that by default a limit (max-count) of 50 records is used
  by nova-manage in the migration, but as numerous records are
  unsuitable for migration (where there is no existing node ID),
  eventually the first 50 records returned by the database query are
  irrelevant, and the migration exits as if it has completed, even
  though many relevant records remain to be migrated.

  A secondary issue is that this query approach causes the migration
  method to be executed many hundreds or thousands of times more than
  necessary as on every iteration it has to ignore the same irrelevant
  records. This takes a long time. In the deployment I've just upgraded
  the migration took upwards of 15 minutes before exiting.

  My suspicion is that the query in
  https://opendev.org/openstack/nova/blame/commit/7096423b343ffce9622fd078fc2b3a87fd3386f7/nova/objects/instance.py#L1359
  should really be filtering out records without a 'node' (or 'host')
  entry to avoid the need for internal exception handling, but there may
  be some other reason this wasn't done initially.

  Steps to reproduce
  ==================
  Using a moderately sized database (a few tens of thousands of records).
  * Perform an upgrade from 2023.1 to 2023.2
  * Execute 'nova-manage db online_data_migrations'

  Expected result
  ===============
  Migrations complete in a reasonable time, with all relevant records migrated.

  Actual result
  =============
  nova-manage exits after a long period of time with an apparent success, but in reality many records remain un-migrated.

  In the two deployments we have migrated to date, we are left with the
  following apparently relevant records which should have been migrated
  but haven't been:

  MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
  +----------+
  | count(*) |
  +----------+
  |    29147 |
  +----------+
  1 row in set (0.045 sec)

  MariaDB [nova]> select count(*) from instances where compute_id is null and node is not null and host is not null;
  +----------+
  | count(*) |
  +----------+
  |    22622 |
  +----------+
  1 row in set (0.048 sec)

  Environment
  ===========
  Nova 45a926156c863b468318cce462a21027685d07a6 (upgraded from 2023.1)
  Libvirt+KVM
  Ceph
  Neutron+LXB

  Logs & Configs
  ==============
  During nova-manage execution, log messages such as the following are printed:

  50 rows matched query populate_instance_compute_id, 6 migrated
  50 rows matched query populate_instance_compute_id, 6 migrated
  50 rows matched query populate_instance_compute_id, 6 migrated
  50 rows matched query populate_instance_compute_id, 6 migrated
  ...
  50 rows matched query populate_instance_compute_id, 1 migrated
  50 rows matched query populate_instance_compute_id, 1 migrated
  50 rows matched query populate_instance_compute_id, 1 migrated
  50 rows matched query populate_instance_compute_id, 1 migrated
  50 rows matched query populate_instance_compute_id, 1 migrated
  50 rows matched query populate_instance_compute_id, 1 migrated
  50 rows matched query populate_instance_compute_id, 1 migrated
  50 rows matched query populate_instance_compute_id, 0 migrated
  +-------------------------------------+--------------+-----------+
  |              Migration              | Total Needed | Completed |
  +-------------------------------------+--------------+-----------+
  |     fill_virtual_interface_list     |      0       |     0     |
  |         migrate_empty_ratio         |      0       |     0     |
  |   migrate_quota_classes_to_api_db   |      0       |     0     |
  |    migrate_quota_limits_to_api_db   |      0       |     0     |
  |      migration_migrate_to_uuid      |      0       |     0     |
  |          populate_dev_uuids         |      0       |     0     |
  |     populate_instance_compute_id    |    54300     |    1667   |
  | populate_missing_availability_zones |      0       |     0     |
  |      populate_queued_for_delete     |      0       |     0     |
  |           populate_user_id          |      0       |     0     |
  |            populate_uuids           |      0       |     0     |
  +-------------------------------------+--------------+-----------+

  Note the repeating number of migrations, indicating that the first 44,
  then 49 records are irrelevant for migration (triggering
  https://opendev.org/openstack/nova/blame/commit/7096423b343ffce9622fd078fc2b3a87fd3386f7/nova/objects/instance.py#L1369).
  Also note that the total needed figure is erroneous as this contains
  duplicate counts of these irrelevant records each time the method is
  called.

  A further run of the migration after the above completion shows:

  50 rows matched query populate_instance_compute_id, 0 migrated
  +-------------------------------------+--------------+-----------+
  |              Migration              | Total Needed | Completed |
  +-------------------------------------+--------------+-----------+
  |     fill_virtual_interface_list     |      0       |     0     |
  |         migrate_empty_ratio         |      0       |     0     |
  |   migrate_quota_classes_to_api_db   |      0       |     0     |
  |    migrate_quota_limits_to_api_db   |      0       |     0     |
  |      migration_migrate_to_uuid      |      0       |     0     |
  |          populate_dev_uuids         |      0       |     0     |
  |     populate_instance_compute_id    |      50      |     0     |
  | populate_missing_availability_zones |      0       |     0     |
  |      populate_queued_for_delete     |      0       |     0     |
  |           populate_user_id          |      0       |     0     |
  |            populate_uuids           |      0       |     0     |
  +-------------------------------------+--------------+-----------+

  If you then increase the --max-count parameter, further migrations
  will proceed:

  100 rows matched query populate_instance_compute_id, 44 migrated
  +-------------------------------------+--------------+-----------+
  |              Migration              | Total Needed | Completed |
  +-------------------------------------+--------------+-----------+
  |     fill_virtual_interface_list     |      0       |     0     |
  |         migrate_empty_ratio         |      0       |     0     |
  |   migrate_quota_classes_to_api_db   |      0       |     0     |
  |    migrate_quota_limits_to_api_db   |      0       |     0     |
  |      migration_migrate_to_uuid      |      0       |     0     |
  |          populate_dev_uuids         |      0       |     0     |
  |     populate_instance_compute_id    |     100      |     44    |
  | populate_missing_availability_zones |      0       |     0     |
  |      populate_queued_for_delete     |      0       |     0     |
  |           populate_user_id          |      0       |     0     |
  |            populate_uuids           |      0       |     0     |
  +-------------------------------------+--------------+-----------+

  In our deployment databases we have the following records which would
  likely trigger this issue:

  MariaDB [nova]> select count(*) from instances where node is null;
  +----------+
  | count(*) |
  +----------+
  |      251 |
  +----------+
  1 row in set (0.029 sec)

  MariaDB [nova]> select count(*) from instances where node is null;
  +----------+
  | count(*) |
  +----------+
  |      141 |
  +----------+
  1 row in set (0.039 sec)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2065403/+subscriptions