yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93480
[Bug 2053163] [NEW] VM hard reboot fails on Live Migration Abort with node having Two numa sockets
Public bug reported:
Description
===========
Able to see live migration on abort is mapping new numa topology to the instance on source & instance continue working in source, while performing hard reboot to re-calculate xml, it is using updated numa topology with cell having no resources & vm is failed to recover
Steps to reproduce [100%]
==================
As part of this, compute node should have two cells or sockets each.
VM flavor having below extra spec
hw:mem_page_size='1048576', hw:numa_nodes='1'
we need to 100 huge pages for specific flavor
Before performing test, make sure source & destination will have below huge page available resources
we need to move VM from source node ( from numa Node1 to Numa Node0 on
compute2 )
Source: [compute1]
~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
Node 0 AnonHugePages: 28672 kB
Node 0 ShmemHugePages: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 HugePages_Total: 210
Node 0 HugePages_Free: 50
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 61440 kB
Node 1 ShmemHugePages: 0 kB
Node 1 FileHugePages: 0 kB
Node 1 HugePages_Total: 210
Node 1 HugePages_Free: 50 <Source node, our test vm will be part of numa Node 1 >
Node 1 HugePages_Surp: 0
Destination: [compute-2]
~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
Node 0 AnonHugePages: 28672 kB
Node 0 ShmemHugePages: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 HugePages_Total: 210
Node 0 HugePages_Free: 130 <destination node having 130 huge pages>
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 61440 kB
Node 1 ShmemHugePages: 0 kB
Node 1 FileHugePages: 0 kB
Node 1 HugePages_Total: 210
Node 1 HugePages_Free: 50
Node 1 HugePages_Surp: 0
Before Live migration please find the numa topology details
MariaDB [nova]> select numa_topology from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
| numa_topology
|
| {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace":
"nova", "nova_object.version": "1.3", "nova_object.data": {"cells":
[{"nova_object.name": "InstanceNUMACell", "nova_object.namespace":
"nova", "nova_object.version": "1.6", "nova_object.data": {"id": 1,
"cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize":
1048576, "cpu_pinning_raw": null, "cpu_policy": null,
"cpu_thread_policy": null}, "nova_object.changes": ["pagesize", "id"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]} |
select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
<empty>
-----END of DB-----
#Trigger live migration
#Apply stress inside vm to achieve migration longer
#
Able to see migration context is created for specific VM:
MariaDB [nova]> select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| migration_context |
| {"nova_object.name": "MigrationContext", "nova_object.namespace":
"nova", "nova_object.version": "1.2", "nova_object.data":
{"instance_uuid": "4b115eb3-59f7-4e27-b877-2e326ef017b3",
"migration_id": 283, "new_numa_topology": {"nova_object.name":
"InstanceNUMATopology", "nova_object.namespace": "nova",
"nova_object.version": "1.3", "nova_object.data": {"cells":
[{"nova_object.name": "InstanceNUMACell", "nova_object.namespace":
"nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0,
"cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize":
1048576, "cpu_pinning_raw": null, "cpu_policy": null,
"cpu_thread_policy": null}, "nova_object.changes": ["cpuset_reserved",
"id", "pcpuset", "pagesize", "cpu_pinning_raw", "cpu_policy",
"cpu_thread_policy", "memory", "cpuset"]}], "emulator_threads_policy":
null}, "nova_object.changes": ["emulator_threads_policy", "cells"]},
"old_numa_topology": {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.6",
"nova_object.data": {"id": 1, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null,
"memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null,
"cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes":
["id", "pagesize"]}], "emulator_threads_policy": null},
"nova_object.changes": ["emulator_threads_policy", "cells"]}
old numa cell is 1, new numa cell is 0
#trigger abort
Feb 13 20:59:00 cdc-appblx095-36 nova-compute[638201]: 2024-02-13
20:59:00.991 638201 ERROR nova.virt.libvirt.driver [None
req-05850c05-ba5b-40ae-a37c-5ccdde8ded47
4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - -
default default] [instance: 4b115eb3-59f7-4e27-b877-2e326ef017b3]
Migration operation has aborted
Post abort numa topology got updated to numa cell 0 which is part of
destination
| {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace":
"nova", "nova_object.version": "1.3", "nova_object.data": {"cells":
[{"nova_object.name": "InstanceNUMACell", "nova_object.namespace":
"nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0,
"cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize":
1048576, "cpu_pinning_raw": null, "cpu_policy": null,
"cpu_thread_policy": null}, "nova_object.changes": ["cpu_thread_policy",
"cpuset_reserved", "cpu_pinning_raw", "cpuset", "cpu_policy", "memory",
"pagesize", "pcpuset", "id"]}], "emulator_threads_policy": null},
"nova_object.changes": ["emulator_threads_policy", "cells"]} |
Migration context is not deleted
Expected result
===============
numa topology of vm should have proper rollback with its original state & further hard reboot of vm's is failing due to no resources available on numa node
Actual result
=============
VM is having newer numa topology based on calculated destination numa details post abort
Environment
===========
Using Antelope version & Ubuntu 6.5.0-15-generic #15~22.04.1-Ubuntu SMP
PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
** Affects: nova
Importance: Undecided
Assignee: keerthivasan (keerthivassan86)
Status: New
** Changed in: nova
Assignee: (unassigned) => keerthivasan (keerthivassan86)
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2053163
Title:
VM hard reboot fails on Live Migration Abort with node having Two
numa sockets
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
Able to see live migration on abort is mapping new numa topology to the instance on source & instance continue working in source, while performing hard reboot to re-calculate xml, it is using updated numa topology with cell having no resources & vm is failed to recover
Steps to reproduce [100%]
==================
As part of this, compute node should have two cells or sockets each.
VM flavor having below extra spec
hw:mem_page_size='1048576', hw:numa_nodes='1'
we need to 100 huge pages for specific flavor
Before performing test, make sure source & destination will have below huge page available resources
we need to move VM from source node ( from numa Node1 to Numa Node0 on
compute2 )
Source: [compute1]
~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
Node 0 AnonHugePages: 28672 kB
Node 0 ShmemHugePages: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 HugePages_Total: 210
Node 0 HugePages_Free: 50
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 61440 kB
Node 1 ShmemHugePages: 0 kB
Node 1 FileHugePages: 0 kB
Node 1 HugePages_Total: 210
Node 1 HugePages_Free: 50 <Source node, our test vm will be part of numa Node 1 >
Node 1 HugePages_Surp: 0
Destination: [compute-2]
~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
Node 0 AnonHugePages: 28672 kB
Node 0 ShmemHugePages: 0 kB
Node 0 FileHugePages: 0 kB
Node 0 HugePages_Total: 210
Node 0 HugePages_Free: 130 <destination node having 130 huge pages>
Node 0 HugePages_Surp: 0
Node 1 AnonHugePages: 61440 kB
Node 1 ShmemHugePages: 0 kB
Node 1 FileHugePages: 0 kB
Node 1 HugePages_Total: 210
Node 1 HugePages_Free: 50
Node 1 HugePages_Surp: 0
Before Live migration please find the numa topology details
MariaDB [nova]> select numa_topology from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
| numa_topology
|
| {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name":
"InstanceNUMACell", "nova_object.namespace": "nova",
"nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset":
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [],
"cpuset_reserved": null, "memory": 81920, "pagesize": 1048576,
"cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy":
null}, "nova_object.changes": ["pagesize", "id"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]} |
select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
<empty>
-----END of DB-----
#Trigger live migration
#Apply stress inside vm to achieve migration longer
#
Able to see migration context is created for specific VM:
MariaDB [nova]> select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| migration_context |
| {"nova_object.name": "MigrationContext", "nova_object.namespace":
"nova", "nova_object.version": "1.2", "nova_object.data":
{"instance_uuid": "4b115eb3-59f7-4e27-b877-2e326ef017b3",
"migration_id": 283, "new_numa_topology": {"nova_object.name":
"InstanceNUMATopology", "nova_object.namespace": "nova",
"nova_object.version": "1.3", "nova_object.data": {"cells":
[{"nova_object.name": "InstanceNUMACell", "nova_object.namespace":
"nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0,
"cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize":
1048576, "cpu_pinning_raw": null, "cpu_policy": null,
"cpu_thread_policy": null}, "nova_object.changes": ["cpuset_reserved",
"id", "pcpuset", "pagesize", "cpu_pinning_raw", "cpu_policy",
"cpu_thread_policy", "memory", "cpuset"]}], "emulator_threads_policy":
null}, "nova_object.changes": ["emulator_threads_policy", "cells"]},
"old_numa_topology": {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name":
"InstanceNUMACell", "nova_object.namespace": "nova",
"nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset":
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [],
"cpuset_reserved": null, "memory": 81920, "pagesize": 1048576,
"cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy":
null}, "nova_object.changes": ["id", "pagesize"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]}
old numa cell is 1, new numa cell is 0
#trigger abort
Feb 13 20:59:00 cdc-appblx095-36 nova-compute[638201]: 2024-02-13
20:59:00.991 638201 ERROR nova.virt.libvirt.driver [None
req-05850c05-ba5b-40ae-a37c-5ccdde8ded47
4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - -
default default] [instance: 4b115eb3-59f7-4e27-b877-2e326ef017b3]
Migration operation has aborted
Post abort numa topology got updated to numa cell 0 which is part of
destination
| {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name":
"InstanceNUMACell", "nova_object.namespace": "nova",
"nova_object.version": "1.6", "nova_object.data": {"id": 0, "cpuset":
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [],
"cpuset_reserved": null, "memory": 81920, "pagesize": 1048576,
"cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy":
null}, "nova_object.changes": ["cpu_thread_policy", "cpuset_reserved",
"cpu_pinning_raw", "cpuset", "cpu_policy", "memory", "pagesize",
"pcpuset", "id"]}], "emulator_threads_policy": null},
"nova_object.changes": ["emulator_threads_policy", "cells"]} |
Migration context is not deleted
Expected result
===============
numa topology of vm should have proper rollback with its original state & further hard reboot of vm's is failing due to no resources available on numa node
Actual result
=============
VM is having newer numa topology based on calculated destination numa details post abort
Environment
===========
Using Antelope version & Ubuntu 6.5.0-15-generic #15~22.04.1-Ubuntu
SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64
GNU/Linux
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2053163/+subscriptions