yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #87715
[Bug 1821755] Re: [SRU] live migration break the anti-affinity policy of server group simultaneously
** Changed in: cloud-archive/stein
Status: Fix Committed => Fix Released
** Changed in: cloud-archive/train
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821755
Title:
[SRU] live migration break the anti-affinity policy of server group
simultaneously
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive stein series:
Fix Released
Status in Ubuntu Cloud Archive train series:
Fix Released
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) train series:
Fix Committed
Status in OpenStack Compute (nova) ussuri series:
Fix Released
Status in OpenStack Compute (nova) victoria series:
Fix Released
Status in OpenStack Compute (nova) wallaby series:
Fix Released
Bug description:
--------------------------------
NOTE: SRU template at the bottom
--------------------------------
Description
===========
If we live migrate two instance simultaneously, the instances will break the instance group policy.
Steps to reproduce
==================
OpenStack env with three compute nodes(node1, node2 and node3). Then we create two VMs(vm1, vm2) with the anti-affinity policy.
At last, we live migrate two VMs simultaneously.
Before live-migration, the VMs are located as followed:
node1 -> vm1
node2 -> vm2
node3
* nova live-migration vm1
* nova live-migration vm2
Expected result
===============
Fail to live migrate vm1 and vm2.
Actual result
=============
node1
node2
node3 -> vm1,vm2
Environment
===========
master branch of openstack
As described above, the live migration could not check the in-progress
live-migration and just select the host by scheduler filter. So that
they are migrated to the same host.
----------------------------------------------------
===============
SRU Description
===============
[Impact]
When performing multiple live migration, cold migration or resize
simultaneously, the affinity or anti-affinity policy is violated,
allowing the migrated VM to land in a host that conflicts with the
policy.
[Test case]
1. Setting up the env
1a. Deploy env with 5 compute nodes
1b. Confirm that all nodes have the same CPU architecture (so live-
migration works between them) either by running lscpu or "openstack
hypervisor show <node>" on each of the nodes
1c. Create anti-affinity policy
openstack server group create anti-aff --policy anti-affinity
1c. Create flavor
openstack flavor create --vcpu 1 --ram 1024 --disk 0 --id 100 test-
flavor
1d. Create volumes
openstack volume create --image cirros --size 1 vol1
openstack volume create --source vol1 --size 1 vol2 && openstack volume create --source vol1 --size 1 vol3
2. Prepare to reproduce the bug
2a. Get group ID
GROUP_ID=$(openstack server group show anti-aff -c id -f value)
2b. Create VMs
openstack server create --network private --volume vol1 --flavor 100
--hint group=$GROUP_ID ins1 && openstack server create --network
private --volume vol2 --flavor 100 --hint group=$GROUP_ID ins2 &&
openstack server create --network private --volume vol3 --flavor 100
--hint group=$GROUP_ID ins3
2c. Confirm each one is in a different host by running "openstack
server list --long" and take note of the hosts
3. Reproducing the bug (Live migration)
3a. Perform set of steps (2) if hasn't.
3b. openstack server migrate ins1 --live-migration & openstack server
migrate ins2 --live-migration & openstack server migrate ins3 --live-
migration
3c. watch "openstack server list --long" until all migrations are
finished
3d. Confirm that at least 1 host is in the same host as another host.
Otherwise, repeat steps 3a - 3c.
4. Reproducing the bug (Cold Migration)
4a. Perform set os steps (2) if hasn't
4b. openstack server migrate ins1 & openstack server migrate ins2 &
openstack server migrate ins3
4c. watch "openstack server list --long" until all statuses are
"VERIFY_RESIZE"
4d. Confirm that at least 1 host is in the same host as another host.
Otherwise, repeat steps 4a - 4c.
4e. Confirm all the resizes running "openstack server resize confirm
<vm>"
5a. Install package that contains the fixed code on all compute nodes
5b. Cleanup all the VMs
6. Confirm fix (Live migration)
6a. Perform steps 3a - 3c
6b. Confirm there are no VMs in the same hosts nor VMs with ERROR
status.
6c. Confirm there are VMs that have ACTIVE status and did not move
hosts. Otherwise, repeat step 6a.
6d. Run "openstack server event list <vm-id>, then "openstack server
event show <vm-id> <req-id>" for the live-migration event of the VMs
assessed in step 6c. Confirm the "message" field is "error" and the
traceback is part of the "compute_check_can_live_migrate_destination"
or "compute_pre_live_migration" events with result=Error and the
traceback ends in the _do_validation function. Repeat this step to
capture both events.
6e. Check the logs for messages related to the VMs assessed in step (6c), where:
- For compute_check_can_live_migrate_destination: egrep -rnIi "MigrationPreCheckError: Migration pre-check error: Failed to validate instance group policy due to.*e9ec173a-4491-4541-9bd4-951692e48c8f.*Anti-affinity instance group policy was violated" /var/log/nova
- For compute_pre_live_migration: grep -rnIi "RescheduledException_Remote: Build of instance c55889d9-6cbe-409a-b118-7b4a8d808972 was re-scheduled: Anti-affinity instance group policy was violated." /var/log/nova
7. Confirm fix (Cold migration)
7a. Perform steps 4a - 4c, while taking note of the the timestamp (by
running $(date)) before running the migration command
7b. Confirm there are no VMs in the same same hosts nor VMs with ERROR
status. There should be VMs with "VERIFY_RESIZE" and "ACTIVE"
statuses. If there are no ACTIVE instances, confirm the resizes and
repeat step 7a.
7c. For the ones that are ACTIVE, check logs for error messages. There
should be message with error about "anti-affinity":
egrep -rnIi "3e926491-d0dc-4611-8e87-75604c67f308.*Anti-affinity
instance group policy was violated" /var/log/nova
/var/log/nova/nova-compute.log:40797:2021-07-22 19:19:54.075 1692
ERROR oslo_messaging.rpc.server nova.exception.RescheduledException:
Build of instance 3e926491-d0dc-4611-8e87-75604c67f308 was re-
scheduled: Anti-affinity instance group policy was violated.
7d. Confirm that the log timestamp matches a few seconds after the
migration command was issued.
7e. Run "openstack server event list <vm-id>", then "openstack server
event show <vm-id> <req-id>" for the migration event. Confirm the
"message" field is "error" and the "events" field include a "No Valid
Host" final message, with the "compute_prep_resize" event with
result=Error and ending the traceback in the _do_validation function.
[Regression Potential]
Part of the new code path has been tested in upstream CI in happy
migration paths. Concurrency has not been tested in the CI to trigger
the error in a negative test. The exception handling code is executed
only in case the exception is raised (in case of policy violation), so
this code path is being tested manually as part of the upstream patch
work and SRU.
[Other Info]
None
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1821755/+subscriptions
References