← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2053163] [NEW] VM hard reboot fails on Live Migration Abort with node having Two numa sockets

 

Public bug reported:

Description
===========
Able to see live migration on abort is mapping new numa topology to the instance on source & instance continue working in source, while performing hard reboot to re-calculate xml, it is using updated numa topology with cell having no resources & vm is failed to recover


Steps to reproduce [100%]
==================


As part of this, compute node should have two cells or sockets each.

VM flavor having below extra spec

hw:mem_page_size='1048576', hw:numa_nodes='1'

we need to 100 huge pages for specific flavor


Before performing test, make sure source & destination will have below huge page available resources

we need to move VM from source node ( from numa Node1 to Numa Node0 on
compute2 )

Source: [compute1]

~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
Node 0 AnonHugePages:     28672 kB
Node 0 ShmemHugePages:        0 kB
Node 0 FileHugePages:        0 kB
Node 0 HugePages_Total:   210
Node 0 HugePages_Free:    50                       
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:     61440 kB
Node 1 ShmemHugePages:        0 kB
Node 1 FileHugePages:        0 kB
Node 1 HugePages_Total:   210
Node 1 HugePages_Free:     50                <Source node, our test vm will be part of numa Node 1 >
Node 1 HugePages_Surp:      0


Destination: [compute-2]

~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
Node 0 AnonHugePages:     28672 kB
Node 0 ShmemHugePages:        0 kB
Node 0 FileHugePages:        0 kB
Node 0 HugePages_Total:   210
Node 0 HugePages_Free:    130                       <destination node having 130 huge pages>
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:     61440 kB
Node 1 ShmemHugePages:        0 kB
Node 1 FileHugePages:        0 kB
Node 1 HugePages_Total:   210
Node 1 HugePages_Free:     50
Node 1 HugePages_Surp:      0


Before Live migration please find the numa topology details


MariaDB [nova]> select numa_topology from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';

| numa_topology
|

| {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace":
"nova", "nova_object.version": "1.3", "nova_object.data": {"cells":
[{"nova_object.name": "InstanceNUMACell", "nova_object.namespace":
"nova", "nova_object.version": "1.6", "nova_object.data": {"id": 1,
"cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize":
1048576, "cpu_pinning_raw": null, "cpu_policy": null,
"cpu_thread_policy": null}, "nova_object.changes": ["pagesize", "id"]}],
"emulator_threads_policy": null}, "nova_object.changes":
["emulator_threads_policy", "cells"]} |

select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
    <empty>

-----END of DB-----

#Trigger live migration


#Apply stress inside vm to achieve migration longer 
#

Able to see migration context is created for specific VM:

MariaDB [nova]> select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| migration_context                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

| {"nova_object.name": "MigrationContext", "nova_object.namespace":
"nova", "nova_object.version": "1.2", "nova_object.data":
{"instance_uuid": "4b115eb3-59f7-4e27-b877-2e326ef017b3",
"migration_id": 283, "new_numa_topology": {"nova_object.name":
"InstanceNUMATopology", "nova_object.namespace": "nova",
"nova_object.version": "1.3", "nova_object.data": {"cells":
[{"nova_object.name": "InstanceNUMACell", "nova_object.namespace":
"nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0,
"cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize":
1048576, "cpu_pinning_raw": null, "cpu_policy": null,
"cpu_thread_policy": null}, "nova_object.changes": ["cpuset_reserved",
"id", "pcpuset", "pagesize", "cpu_pinning_raw", "cpu_policy",
"cpu_thread_policy", "memory", "cpuset"]}], "emulator_threads_policy":
null}, "nova_object.changes": ["emulator_threads_policy", "cells"]},
"old_numa_topology": {"nova_object.name": "InstanceNUMATopology",
"nova_object.namespace": "nova", "nova_object.version": "1.3",
"nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell",
"nova_object.namespace": "nova", "nova_object.version": "1.6",
"nova_object.data": {"id": 1, "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15], "pcpuset": [], "cpuset_reserved": null,
"memory": 81920, "pagesize": 1048576, "cpu_pinning_raw": null,
"cpu_policy": null, "cpu_thread_policy": null}, "nova_object.changes":
["id", "pagesize"]}], "emulator_threads_policy": null},
"nova_object.changes": ["emulator_threads_policy", "cells"]}

old numa cell is 1, new numa cell is 0

#trigger abort

Feb 13 20:59:00 cdc-appblx095-36 nova-compute[638201]: 2024-02-13
20:59:00.991 638201 ERROR nova.virt.libvirt.driver [None
req-05850c05-ba5b-40ae-a37c-5ccdde8ded47
4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - -
default default] [instance: 4b115eb3-59f7-4e27-b877-2e326ef017b3]
Migration operation has aborted

Post abort numa topology got updated to numa cell 0 which is part of
destination

| {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace":
"nova", "nova_object.version": "1.3", "nova_object.data": {"cells":
[{"nova_object.name": "InstanceNUMACell", "nova_object.namespace":
"nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0,
"cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
"pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize":
1048576, "cpu_pinning_raw": null, "cpu_policy": null,
"cpu_thread_policy": null}, "nova_object.changes": ["cpu_thread_policy",
"cpuset_reserved", "cpu_pinning_raw", "cpuset", "cpu_policy", "memory",
"pagesize", "pcpuset", "id"]}], "emulator_threads_policy": null},
"nova_object.changes": ["emulator_threads_policy", "cells"]} |

Migration context is not deleted


Expected result
===============
numa topology of vm should have proper rollback with its original state & further hard reboot of vm's is failing due to no resources available on numa node 

Actual result
=============
VM is having newer numa topology based on calculated destination numa details post abort

Environment
===========

Using Antelope version & Ubuntu 6.5.0-15-generic #15~22.04.1-Ubuntu SMP
PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

** Affects: nova
     Importance: Undecided
     Assignee: keerthivasan (keerthivassan86)
         Status: New

** Changed in: nova
     Assignee: (unassigned) => keerthivasan (keerthivassan86)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2053163

Title:
  VM hard reboot fails on Live Migration Abort  with node having Two
  numa sockets

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  Able to see live migration on abort is mapping new numa topology to the instance on source & instance continue working in source, while performing hard reboot to re-calculate xml, it is using updated numa topology with cell having no resources & vm is failed to recover

  
  Steps to reproduce [100%]
  ==================

  
  As part of this, compute node should have two cells or sockets each.

  VM flavor having below extra spec

  hw:mem_page_size='1048576', hw:numa_nodes='1'

  we need to 100 huge pages for specific flavor

  
  Before performing test, make sure source & destination will have below huge page available resources

  we need to move VM from source node ( from numa Node1 to Numa Node0 on
  compute2 )

  Source: [compute1]

  ~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
  Node 0 AnonHugePages:     28672 kB
  Node 0 ShmemHugePages:        0 kB
  Node 0 FileHugePages:        0 kB
  Node 0 HugePages_Total:   210
  Node 0 HugePages_Free:    50                       
  Node 0 HugePages_Surp:      0
  Node 1 AnonHugePages:     61440 kB
  Node 1 ShmemHugePages:        0 kB
  Node 1 FileHugePages:        0 kB
  Node 1 HugePages_Total:   210
  Node 1 HugePages_Free:     50                <Source node, our test vm will be part of numa Node 1 >
  Node 1 HugePages_Surp:      0

  
  Destination: [compute-2]

  ~# cat /sys/devices/system/node/node*/meminfo | grep -i huge
  Node 0 AnonHugePages:     28672 kB
  Node 0 ShmemHugePages:        0 kB
  Node 0 FileHugePages:        0 kB
  Node 0 HugePages_Total:   210
  Node 0 HugePages_Free:    130                       <destination node having 130 huge pages>
  Node 0 HugePages_Surp:      0
  Node 1 AnonHugePages:     61440 kB
  Node 1 ShmemHugePages:        0 kB
  Node 1 FileHugePages:        0 kB
  Node 1 HugePages_Total:   210
  Node 1 HugePages_Free:     50
  Node 1 HugePages_Surp:      0

  
  Before Live migration please find the numa topology details

  
  MariaDB [nova]> select numa_topology from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';

  | numa_topology
  |

  | {"nova_object.name": "InstanceNUMATopology",
  "nova_object.namespace": "nova", "nova_object.version": "1.3",
  "nova_object.data": {"cells": [{"nova_object.name":
  "InstanceNUMACell", "nova_object.namespace": "nova",
  "nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset":
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [],
  "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576,
  "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy":
  null}, "nova_object.changes": ["pagesize", "id"]}],
  "emulator_threads_policy": null}, "nova_object.changes":
  ["emulator_threads_policy", "cells"]} |

  select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
      <empty>

  -----END of DB-----

  #Trigger live migration

  
  #Apply stress inside vm to achieve migration longer 
  #

  Able to see migration context is created for specific VM:

  MariaDB [nova]> select migration_context from instance_extra where instance_uuid='4b115eb3-59f7-4e27-b877-2e326ef017b3';
  +-----------------------------------------------------------------------------------------------------------------------------------------------------------------
  | migration_context                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

  | {"nova_object.name": "MigrationContext", "nova_object.namespace":
  "nova", "nova_object.version": "1.2", "nova_object.data":
  {"instance_uuid": "4b115eb3-59f7-4e27-b877-2e326ef017b3",
  "migration_id": 283, "new_numa_topology": {"nova_object.name":
  "InstanceNUMATopology", "nova_object.namespace": "nova",
  "nova_object.version": "1.3", "nova_object.data": {"cells":
  [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace":
  "nova", "nova_object.version": "1.6", "nova_object.data": {"id": 0,
  "cpuset": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
  "pcpuset": [], "cpuset_reserved": null, "memory": 81920, "pagesize":
  1048576, "cpu_pinning_raw": null, "cpu_policy": null,
  "cpu_thread_policy": null}, "nova_object.changes": ["cpuset_reserved",
  "id", "pcpuset", "pagesize", "cpu_pinning_raw", "cpu_policy",
  "cpu_thread_policy", "memory", "cpuset"]}], "emulator_threads_policy":
  null}, "nova_object.changes": ["emulator_threads_policy", "cells"]},
  "old_numa_topology": {"nova_object.name": "InstanceNUMATopology",
  "nova_object.namespace": "nova", "nova_object.version": "1.3",
  "nova_object.data": {"cells": [{"nova_object.name":
  "InstanceNUMACell", "nova_object.namespace": "nova",
  "nova_object.version": "1.6", "nova_object.data": {"id": 1, "cpuset":
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [],
  "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576,
  "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy":
  null}, "nova_object.changes": ["id", "pagesize"]}],
  "emulator_threads_policy": null}, "nova_object.changes":
  ["emulator_threads_policy", "cells"]}

  old numa cell is 1, new numa cell is 0

  #trigger abort

  Feb 13 20:59:00 cdc-appblx095-36 nova-compute[638201]: 2024-02-13
  20:59:00.991 638201 ERROR nova.virt.libvirt.driver [None
  req-05850c05-ba5b-40ae-a37c-5ccdde8ded47
  4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - -
  default default] [instance: 4b115eb3-59f7-4e27-b877-2e326ef017b3]
  Migration operation has aborted

  Post abort numa topology got updated to numa cell 0 which is part of
  destination

  | {"nova_object.name": "InstanceNUMATopology",
  "nova_object.namespace": "nova", "nova_object.version": "1.3",
  "nova_object.data": {"cells": [{"nova_object.name":
  "InstanceNUMACell", "nova_object.namespace": "nova",
  "nova_object.version": "1.6", "nova_object.data": {"id": 0, "cpuset":
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], "pcpuset": [],
  "cpuset_reserved": null, "memory": 81920, "pagesize": 1048576,
  "cpu_pinning_raw": null, "cpu_policy": null, "cpu_thread_policy":
  null}, "nova_object.changes": ["cpu_thread_policy", "cpuset_reserved",
  "cpu_pinning_raw", "cpuset", "cpu_policy", "memory", "pagesize",
  "pcpuset", "id"]}], "emulator_threads_policy": null},
  "nova_object.changes": ["emulator_threads_policy", "cells"]} |

  Migration context is not deleted

  
  Expected result
  ===============
  numa topology of vm should have proper rollback with its original state & further hard reboot of vm's is failing due to no resources available on numa node 

  Actual result
  =============
  VM is having newer numa topology based on calculated destination numa details post abort

  Environment
  ===========

  Using Antelope version & Ubuntu 6.5.0-15-generic #15~22.04.1-Ubuntu
  SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64
  GNU/Linux

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2053163/+subscriptions