← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1945983] [NEW] Instances with trunk ports attached taking a long time to delete

 

Public bug reported:

Description
===========

The initial request to destroy the instance
(45adbb55-491d-418b-ba68-7db43d1c235b / instance-0000000d) that had been
live migrated to the host comes into n-cpu here:

https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt

 9916 Oct 01 11:10:33.162056 ubuntu-focal-iweb-mtl01-0026751352 nova-
compute[67743]: DEBUG oslo_concurrency.lockutils [None
req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project]
Lock "45adbb55-491d-418b-ba68-7db43d1c235b" acquired by
"nova.compute.manager.ComputeManager.terminate_instance.<locals>.do_terminate_instance"
:: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist-
packages/oslo_concurrency/lockutils.py:355}}

In libvirtd we see the thread attempt and fail to terminate the process
that appears busy:

https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
live-migration/1cb6fb3/compute1/logs/libvirt/libvirt/libvirtd_log.txt

77557 2021-10-01 11:10:33.173+0000: 57258: debug : virThreadJobSet:93 : Thread 57258 (virNetServerHandleJob) is now running job remoteDispatchDomainDestroy
77558 2021-10-01 11:10:33.173+0000: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-0000000d pid=73535 flags=0x1
77559 2021-10-01 11:10:33.173+0000: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0
[..]
77673 2021-10-01 11:10:43.180+0000: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL
[..]
77674 2021-10-01 11:11:13.202+0000: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy
77675 2021-10-01 11:11:13.202+0000: 57258: debug : qemuDomainObjBeginJobInternal:9416 : Starting job: job=destroy agentJob=none asyncJob=none (vm=0x7f93400f98c0 name=instance-0000000d, current job=none agentJob=      none async=none)
77676 2021-10-01 11:11:13.202+0000: 57258: debug : qemuDomainObjBeginJobInternal:9470 : Started job: destroy (async=none vm=0x7f93400f98c0 name=instance-0000000d)
77677 2021-10-01 11:11:13.203+0000: 57258: debug : qemuProcessStop:7279 : Shutting down vm=0x7f93400f98c0 name=instance-0000000d id=14 pid=73535, reason=destroyed, asyncJob=none, flags=0x0
77678 2021-10-01 11:11:13.203+0000: 57258: debug : qemuDomainLogAppendMessage:10691 : Append log message (vm='instance-0000000d' message='2021-10-01 11:11:13.203+0000: shutting down, reason=destroyed
77679 ) stdioLogD=1                                                                   
77680 2021-10-01 11:11:13.204+0000: 57258: info : qemuMonitorClose:916 : QEMU_MONITOR_CLOSE: mon=0x7f93400ecce0 refs=4
77681 2021-10-01 11:11:13.205+0000: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-0000000d pid=73535 flags=0x5
77682 2021-10-01 11:11:13.205+0000: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0 
77683 2021-10-01 11:11:23.213+0000: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL
[..]
77684 2021-10-01 11:11:53.237+0000: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy

Towards the end we get an idea why this could be happening with the
following virNetDevTapDelete failure:

77689 2021-10-01 11:11:53.241+0000: 57258: error : virNetDevTapDelete:339 : Unable to associate TAP device: Device or resource busy
77690 2021-10-01 11:11:53.241+0000: 57258: debug : virSystemdTerminateMachine:488 : Attempting to terminate machine via systemd

libvirtd then attempts to kill the domain via systemd and this
eventually completes but in the meantime the attached volume has
disappeared, this might be due to Tempest cleaning the volume up in the
background:

77955 2021-10-01 11:11:54.266+0000: 57258: debug : virCgroupV1Remove:699 : Done removing cgroup /machine.slice/machine-qemu\x2d14\x2dinstance\x2d0000000d.scope
77956 2021-10-01 11:11:54.266+0000: 57258: warning : qemuProcessStop:7488 : Failed to remove cgroup for instance-0000000d
77957 2021-10-01 11:12:30.391+0000: 57258: error : virProcessRunInFork:1159 : internal error: child reported (status=125): unable to open /dev/sdb: No such device or address
77958 2021-10-01 11:12:30.391+0000: 57258: warning : qemuBlockRemoveImageMetadata:2629 : Unable to remove disk metadata on vm instance-0000000d from /dev/sdb (disk target vda)
77959 2021-10-01 11:12:30.392+0000: 57258: debug : virObjectEventNew:631 : obj=0x7f934c087670
77960 2021-10-01 11:12:30.392+0000: 57258: debug : qemuDomainObjEndJob:9746 : Stopping job: destroy (async=none vm=0x7f93400f98c0 name=instance-0000000d)
[.]
77962 2021-10-01 11:12:30.393+0000: 57258: debug : virThreadJobClear:118 : Thread 57258 (virNetServerHandleJob) finished job remoteDispatchDomainDestroy with ret=0

Back in n-cpu we see the vif unplug correctly:

https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt

 9990 Oct 01 11:12:30.475937 ubuntu-focal-iweb-mtl01-0026751352 nova-
compute[67743]: INFO os_vif [None
req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
LiveMigrationTest-22574426 tempest-LiveMigrationTest-225744
26-project] Successfully unplugged vif
VIFOpenVSwitch(active=True,address=fa:16:3e:68:e0:63,bridge_name='br-
int',has_traffic_filtering=True,id=ea034ac1-f87f-4319-8c17-a9748af7990d,network=Network(cff81725-
6326-4be1-97f5-116b0b60d276),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapea034ac1-f8')


We then encounter issues disconnecting the attached volume from the instance:

 9993 Oct 01 11:12:30.479947 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: DEBUG oslo_concurrency.lockutils [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] Lock "connect_volume" acquired by "os_brick.initiator.connectors.iscsi.ISCSIConnector.disconnect_volume" :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist-      packages/oslo_concurrency/lockutils.py:355}}
[..]
10122 Oct 01 11:13:30.641380 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: WARNING nova.virt.libvirt.driver [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] [instance: 45adbb55-491d-418b-ba68-7db43d1c235b] Ignoring Volume Error on vol ba95db3d-c2ff-40c6-ba7c-563c3070c85d during delete Unexpected error while running command.

Finally there's another ~60 second gap until the request finally
completes:

10135 Oct 01 11:14:33.313929 ubuntu-focal-iweb-mtl01-0026751352 nova-
compute[67743]: INFO nova.compute.manager [None
req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project]
[instance: 45adbb55-491d-418b-ba68- 7db43d1c235b] Took 240.14 seconds to
destroy the instance on the hypervisor.


Steps to reproduce
==================

LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any
live migration test with trunk ports fails during cleanup.

Expected result
===============

Both the test and cleanup pass without impacting libvirtd.

Actual result
=============

The test passes, cleanup locks up the single thread handling the
libvirtd event loop in 6.0.0.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   stable/xena and master

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   libvirt (6.0.0) and QEMU

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

    Trunk ports.

Logs & Configs
==============

Initially discovered and discussed as part of
https://bugs.launchpad.net/nova/+bug/1912310 where the locking up within
libvirtd causes other tests to then fail.

** Affects: nova
     Importance: Critical
         Status: New


** Tags: libvirt neutron

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1945983

Title:
  Instances with trunk ports attached taking a long time to delete

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  The initial request to destroy the instance
  (45adbb55-491d-418b-ba68-7db43d1c235b / instance-0000000d) that had
  been live migrated to the host comes into n-cpu here:

  https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
  live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt

   9916 Oct 01 11:10:33.162056 ubuntu-focal-iweb-mtl01-0026751352 nova-
  compute[67743]: DEBUG oslo_concurrency.lockutils [None
  req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
  LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project]
  Lock "45adbb55-491d-418b-ba68-7db43d1c235b" acquired by
  "nova.compute.manager.ComputeManager.terminate_instance.<locals>.do_terminate_instance"
  :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist-
  packages/oslo_concurrency/lockutils.py:355}}

  In libvirtd we see the thread attempt and fail to terminate the
  process that appears busy:

  https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
  live-migration/1cb6fb3/compute1/logs/libvirt/libvirt/libvirtd_log.txt

  77557 2021-10-01 11:10:33.173+0000: 57258: debug : virThreadJobSet:93 : Thread 57258 (virNetServerHandleJob) is now running job remoteDispatchDomainDestroy
  77558 2021-10-01 11:10:33.173+0000: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-0000000d pid=73535 flags=0x1
  77559 2021-10-01 11:10:33.173+0000: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0
  [..]
  77673 2021-10-01 11:10:43.180+0000: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL
  [..]
  77674 2021-10-01 11:11:13.202+0000: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy
  77675 2021-10-01 11:11:13.202+0000: 57258: debug : qemuDomainObjBeginJobInternal:9416 : Starting job: job=destroy agentJob=none asyncJob=none (vm=0x7f93400f98c0 name=instance-0000000d, current job=none agentJob=      none async=none)
  77676 2021-10-01 11:11:13.202+0000: 57258: debug : qemuDomainObjBeginJobInternal:9470 : Started job: destroy (async=none vm=0x7f93400f98c0 name=instance-0000000d)
  77677 2021-10-01 11:11:13.203+0000: 57258: debug : qemuProcessStop:7279 : Shutting down vm=0x7f93400f98c0 name=instance-0000000d id=14 pid=73535, reason=destroyed, asyncJob=none, flags=0x0
  77678 2021-10-01 11:11:13.203+0000: 57258: debug : qemuDomainLogAppendMessage:10691 : Append log message (vm='instance-0000000d' message='2021-10-01 11:11:13.203+0000: shutting down, reason=destroyed
  77679 ) stdioLogD=1                                                                   
  77680 2021-10-01 11:11:13.204+0000: 57258: info : qemuMonitorClose:916 : QEMU_MONITOR_CLOSE: mon=0x7f93400ecce0 refs=4
  77681 2021-10-01 11:11:13.205+0000: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-0000000d pid=73535 flags=0x5
  77682 2021-10-01 11:11:13.205+0000: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0 
  77683 2021-10-01 11:11:23.213+0000: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL
  [..]
  77684 2021-10-01 11:11:53.237+0000: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy

  Towards the end we get an idea why this could be happening with the
  following virNetDevTapDelete failure:

  77689 2021-10-01 11:11:53.241+0000: 57258: error : virNetDevTapDelete:339 : Unable to associate TAP device: Device or resource busy
  77690 2021-10-01 11:11:53.241+0000: 57258: debug : virSystemdTerminateMachine:488 : Attempting to terminate machine via systemd

  libvirtd then attempts to kill the domain via systemd and this
  eventually completes but in the meantime the attached volume has
  disappeared, this might be due to Tempest cleaning the volume up in
  the background:

  77955 2021-10-01 11:11:54.266+0000: 57258: debug : virCgroupV1Remove:699 : Done removing cgroup /machine.slice/machine-qemu\x2d14\x2dinstance\x2d0000000d.scope
  77956 2021-10-01 11:11:54.266+0000: 57258: warning : qemuProcessStop:7488 : Failed to remove cgroup for instance-0000000d
  77957 2021-10-01 11:12:30.391+0000: 57258: error : virProcessRunInFork:1159 : internal error: child reported (status=125): unable to open /dev/sdb: No such device or address
  77958 2021-10-01 11:12:30.391+0000: 57258: warning : qemuBlockRemoveImageMetadata:2629 : Unable to remove disk metadata on vm instance-0000000d from /dev/sdb (disk target vda)
  77959 2021-10-01 11:12:30.392+0000: 57258: debug : virObjectEventNew:631 : obj=0x7f934c087670
  77960 2021-10-01 11:12:30.392+0000: 57258: debug : qemuDomainObjEndJob:9746 : Stopping job: destroy (async=none vm=0x7f93400f98c0 name=instance-0000000d)
  [.]
  77962 2021-10-01 11:12:30.393+0000: 57258: debug : virThreadJobClear:118 : Thread 57258 (virNetServerHandleJob) finished job remoteDispatchDomainDestroy with ret=0

  Back in n-cpu we see the vif unplug correctly:

  https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
  live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt

   9990 Oct 01 11:12:30.475937 ubuntu-focal-iweb-mtl01-0026751352 nova-
  compute[67743]: INFO os_vif [None
  req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
  LiveMigrationTest-22574426 tempest-LiveMigrationTest-225744
  26-project] Successfully unplugged vif
  VIFOpenVSwitch(active=True,address=fa:16:3e:68:e0:63,bridge_name='br-
  int',has_traffic_filtering=True,id=ea034ac1-f87f-4319-8c17-a9748af7990d,network=Network(cff81725-
  6326-4be1-97f5-116b0b60d276),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapea034ac1-f8')

  
  We then encounter issues disconnecting the attached volume from the instance:

   9993 Oct 01 11:12:30.479947 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: DEBUG oslo_concurrency.lockutils [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] Lock "connect_volume" acquired by "os_brick.initiator.connectors.iscsi.ISCSIConnector.disconnect_volume" :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist-      packages/oslo_concurrency/lockutils.py:355}}
  [..]
  10122 Oct 01 11:13:30.641380 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: WARNING nova.virt.libvirt.driver [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] [instance: 45adbb55-491d-418b-ba68-7db43d1c235b] Ignoring Volume Error on vol ba95db3d-c2ff-40c6-ba7c-563c3070c85d during delete Unexpected error while running command.

  Finally there's another ~60 second gap until the request finally
  completes:

  10135 Oct 01 11:14:33.313929 ubuntu-focal-iweb-mtl01-0026751352 nova-
  compute[67743]: INFO nova.compute.manager [None
  req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
  LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project]
  [instance: 45adbb55-491d-418b-ba68- 7db43d1c235b] Took 240.14 seconds
  to destroy the instance on the hypervisor.

  
  Steps to reproduce
  ==================

  LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any
  live migration test with trunk ports fails during cleanup.

  Expected result
  ===============

  Both the test and cleanup pass without impacting libvirtd.

  Actual result
  =============

  The test passes, cleanup locks up the single thread handling the
  libvirtd event loop in 6.0.0.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     stable/xena and master

  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     libvirt (6.0.0) and QEMU

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

      Trunk ports.

  Logs & Configs
  ==============

  Initially discovered and discussed as part of
  https://bugs.launchpad.net/nova/+bug/1912310 where the locking up
  within libvirtd causes other tests to then fail.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1945983/+subscriptions



Follow ups