yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93383
[Bug 2051685] [NEW] After repeat of incomplete migration nova applies wrong (status=error) migration context in update_available_resource periodic job
Public bug reported:
The original problem observed in a downstream deployment was of
overcommit on dedicated PCPUs and CPUPinningInvalid exception breaking
update_available_resource periodic job.
The following reproduction is not an end-to-end reproduction, but I hope
I can demonstrate where things go wrong.
The environment is a multi-node devstack:
devstack0 - all-in-one
devstack0a - compute
Nova is backed by libvirt/qemu/kvm.
devstack 6b0f055b
nova on devstack0 39f560d673
nova on devstack0a a72f7eaac7
libvirt 8.0.0-1ubuntu7.8
qemu 1:6.2+dfsg-2ubuntu6.16
linux 5.15.0-91-generic
# Clean up if not the first run.
openstack server list -f value -c ID | xargs -r openstack server delete --wait
openstack volume list --status available -f value -c ID | xargs -r openstack volume delete
# Create a server on devstack0.
openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated
openstack server create --flavor cirros256-pinned --image cirros-0.6.2-x86_64-disk --boot-from-volume 1 --nic net-id=private --availability-zone :devstack0 vm0 --wait
# Start a live migration to devstack0a, but simulate a failure. In my environment a complete live migration takes around 20 seconds. Using 'sleep 3' it usually breaks in the 'preparing' status.
# As far as I understand other kinds of migration (like cold migration) are also affected.
openstack server migrate --live-migration vm0 --wait & sleep 2 ; ssh devstack0a sudo systemctl stop devstack@n-cpu
$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | preparing | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4a2c-b42a- | | | | | | | 4c12-ad47- | | | | 29T12:41:40.000000 | 29T12:41:42.000000 |
| | a73b1a19c0c9 | | | | | | | 12b8dd7a7384 | | | | | |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
# After some timeout (around 60 s) the migration goes to 'error' status.
$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
| 33 | c7a42f9e-dfee-4a2c- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9-4c12- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | b42a-a73b1a19c0c9 | | | | | | | ad47-12b8dd7a7384 | | | | 29T12:41:40.000000 | 29T12:42:42.000000 |
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
# Wait before restarting n-cpu on devstack0a. I don't think I fully understand the factors of when the migration ends up finally in failed or in error status. Currently it seems to me if I restart n-cpu too quickly the migration goes to the failed state right after restart. But for the reproduction of this bug it has to stay in 'error'. So far a restart only after 10+ minutes always left the migration in 'error' status.
sleep 600 ; ssh devstack0a sudo systemctl start devstack@n-cpu
# The server gets stuck in MIGRATING status.
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
MIGRATING
# We reset it to ACTIVE. While I don't have an at-will upstream reproduction at the moment without 'set --state', I believe the same end problem was observed in downstream environments without ever resorting to 'set --state'.
$ openstack server set --state active vm0
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
ACTIVE
# Start following n-cpu logs on devstack0a.
sudo journalctl -u devstack@n-cpu.service -f | egrep --line-buffered 'ERROR|Applying migration context for instance'
# Try to live migrate the same server again, this time let it finish. It completes successfully. (I have occasionally observed the 2nd migration to fail with "nova.exception.DestinationDiskExists: The supplied disk path (/opt/stack/data/nova/instances/19700c16-88d9-416c-9fb8-b79bea7925ee) already exists, it is expected not to exist." My best guess is that I did not give enought time to n-cpu on devstack0a to clean up after its restart. Most of the time this error does not happen.)
openstack server migrate --live-migration vm0 --wait
# However in the logs of n-cpu on devstack0a we see that update_available_resource periodic job applies the previously failed migration context and keeps it applying every minute.
$ sudo journalctl -u devstack@n-cpu.service -f | egrep --line-buffered 'ERROR|Applying migration context for instance'
jan 29 12:57:37 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
jan 29 12:58:38 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
jan 29 12:59:38 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
...
# For reference we have these migrations at the end.
rubasov@devstack0:~/src/os/openstack/devstack$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4a2c-b42a- | | | | | | | 4c12-ad47- | | | | 29T12:41:40.000000 | 29T12:42:42.000000 |
| | a73b1a19c0c9 | | | | | | | 12b8dd7a7384 | | | | | |
| 34 | 20d7f349-2d5a- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | completed | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4796-bedc- | | | | | | | 4c12-ad47- | | | | 29T12:56:57.000000 | 29T12:57:10.000000 |
| | 8bc6e06134cf | | | | | | | 12b8dd7a7384 | | | | | |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
# The relevant code.
https://opendev.org/openstack/nova/src/commit/6531ed6310c4c11ee807e10d1a0fa91658c3afea/nova/compute/resource_tracker.py#L995-L1024
I don't see it yet, how this should be fixed. Maybe a migration should
never be stuck indefinitely in status=error. To me status=failed seems
to be the proper final state. Maybe we should not apply a status=error
migration context when there is a successfully repeated migration of the
same server to the same dst host. Or at least we should not apply the
status=error migration context in every update_available_resource call.
Maybe we also should do some cleanup when resetting the server state,
however I believe the problem can appear without ever resetting the
state, so that would not be a generic fix. Let me know what you think...
Also please let me know if this is sufficient to demonstrate there is a
bug here. If needed, I can work on extending the reproduction to re-
create the actual overcommit problems, but I hope applying the wrong
migration context already shows we have a problem here and I can spare
doing that work.
** Affects: nova
Importance: Undecided
Status: New
** Tags: resource-tracker
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2051685
Title:
After repeat of incomplete migration nova applies wrong (status=error)
migration context in update_available_resource periodic job
Status in OpenStack Compute (nova):
New
Bug description:
The original problem observed in a downstream deployment was of
overcommit on dedicated PCPUs and CPUPinningInvalid exception breaking
update_available_resource periodic job.
The following reproduction is not an end-to-end reproduction, but I
hope I can demonstrate where things go wrong.
The environment is a multi-node devstack:
devstack0 - all-in-one
devstack0a - compute
Nova is backed by libvirt/qemu/kvm.
devstack 6b0f055b
nova on devstack0 39f560d673
nova on devstack0a a72f7eaac7
libvirt 8.0.0-1ubuntu7.8
qemu 1:6.2+dfsg-2ubuntu6.16
linux 5.15.0-91-generic
# Clean up if not the first run.
openstack server list -f value -c ID | xargs -r openstack server delete --wait
openstack volume list --status available -f value -c ID | xargs -r openstack volume delete
# Create a server on devstack0.
openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 1 --property hw_rng:allowed=True --property hw:cpu_policy=dedicated
openstack server create --flavor cirros256-pinned --image cirros-0.6.2-x86_64-disk --boot-from-volume 1 --nic net-id=private --availability-zone :devstack0 vm0 --wait
# Start a live migration to devstack0a, but simulate a failure. In my environment a complete live migration takes around 20 seconds. Using 'sleep 3' it usually breaks in the 'preparing' status.
# As far as I understand other kinds of migration (like cold migration) are also affected.
openstack server migrate --live-migration vm0 --wait & sleep 2 ; ssh devstack0a sudo systemctl stop devstack@n-cpu
$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | preparing | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4a2c-b42a- | | | | | | | 4c12-ad47- | | | | 29T12:41:40.000000 | 29T12:41:42.000000 |
| | a73b1a19c0c9 | | | | | | | 12b8dd7a7384 | | | | | |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
# After some timeout (around 60 s) the migration goes to 'error' status.
$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
| 33 | c7a42f9e-dfee-4a2c- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9-4c12- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | b42a-a73b1a19c0c9 | | | | | | | ad47-12b8dd7a7384 | | | | 29T12:41:40.000000 | 29T12:42:42.000000 |
+----+---------------------+-------------+------------+----------------+--------------+----------------+--------+---------------------+------------+------------+----------------+---------------------+----------------------+
# Wait before restarting n-cpu on devstack0a. I don't think I fully understand the factors of when the migration ends up finally in failed or in error status. Currently it seems to me if I restart n-cpu too quickly the migration goes to the failed state right after restart. But for the reproduction of this bug it has to stay in 'error'. So far a restart only after 10+ minutes always left the migration in 'error' status.
sleep 600 ; ssh devstack0a sudo systemctl start devstack@n-cpu
# The server gets stuck in MIGRATING status.
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
MIGRATING
# We reset it to ACTIVE. While I don't have an at-will upstream reproduction at the moment without 'set --state', I believe the same end problem was observed in downstream environments without ever resorting to 'set --state'.
$ openstack server set --state active vm0
$ openstack server show vm0 -f value -c status -c compute_host
devstack0
ACTIVE
# Start following n-cpu logs on devstack0a.
sudo journalctl -u devstack@n-cpu.service -f | egrep --line-buffered 'ERROR|Applying migration context for instance'
# Try to live migrate the same server again, this time let it finish. It completes successfully. (I have occasionally observed the 2nd migration to fail with "nova.exception.DestinationDiskExists: The supplied disk path (/opt/stack/data/nova/instances/19700c16-88d9-416c-9fb8-b79bea7925ee) already exists, it is expected not to exist." My best guess is that I did not give enought time to n-cpu on devstack0a to clean up after its restart. Most of the time this error does not happen.)
openstack server migrate --live-migration vm0 --wait
# However in the logs of n-cpu on devstack0a we see that update_available_resource periodic job applies the previously failed migration context and keeps it applying every minute.
$ sudo journalctl -u devstack@n-cpu.service -f | egrep --line-buffered 'ERROR|Applying migration context for instance'
jan 29 12:57:37 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
jan 29 12:58:38 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
jan 29 12:59:38 devstack0a nova-compute[34197]: DEBUG nova.compute.resource_tracker [None req-ec6eeebe-f367-4246-b440-9c34915395cc None None] Applying migration context for instance a2b43180-8ad9-4c12-ad47-12b8dd7a7384 as it has an incoming, in-progress migration c7a42f9e-dfee-4a2c-b42a-a73b1a19c0c9. Migration status is error {{(pid=34197) _update_available_resource /opt/stack/nova/nova/compute/resource_tracker.py:1016}}
...
# For reference we have these migrations at the end.
rubasov@devstack0:~/src/os/openstack/devstack$ openstack server migration list --server vm0 --sort-column 'Created At'
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
| 33 | c7a42f9e-dfee- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | error | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4a2c-b42a- | | | | | | | 4c12-ad47- | | | | 29T12:41:40.000000 | 29T12:42:42.000000 |
| | a73b1a19c0c9 | | | | | | | 12b8dd7a7384 | | | | | |
| 34 | 20d7f349-2d5a- | devstack0 | devstack0a | devstack0 | devstack0a | 192.168.122.79 | completed | a2b43180-8ad9- | 11 | 11 | live-migration | 2024-01- | 2024-01- |
| | 4796-bedc- | | | | | | | 4c12-ad47- | | | | 29T12:56:57.000000 | 29T12:57:10.000000 |
| | 8bc6e06134cf | | | | | | | 12b8dd7a7384 | | | | | |
+----+--------------------+-------------+------------+----------------+--------------+----------------+-----------+--------------------+------------+------------+----------------+--------------------+-------------------------+
# The relevant code.
https://opendev.org/openstack/nova/src/commit/6531ed6310c4c11ee807e10d1a0fa91658c3afea/nova/compute/resource_tracker.py#L995-L1024
I don't see it yet, how this should be fixed. Maybe a migration should
never be stuck indefinitely in status=error. To me status=failed seems
to be the proper final state. Maybe we should not apply a status=error
migration context when there is a successfully repeated migration of
the same server to the same dst host. Or at least we should not apply
the status=error migration context in every update_available_resource
call. Maybe we also should do some cleanup when resetting the server
state, however I believe the problem can appear without ever resetting
the state, so that would not be a generic fix. Let me know what you
think...
Also please let me know if this is sufficient to demonstrate there is
a bug here. If needed, I can work on extending the reproduction to re-
create the actual overcommit problems, but I hope applying the wrong
migration context already shows we have a problem here and I can spare
doing that work.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2051685/+subscriptions