← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1945983] Re: Duplicate iSCSI initiators causing live migration failures

 

** Also affects: devstack
   Importance: Undecided
       Status: New

** Description changed:

  Description
  ===========
  
- The initial request to destroy the instance
- (45adbb55-491d-418b-ba68-7db43d1c235b / instance-0000000d) that had been
- live migrated to the host comes into n-cpu here:
- 
- https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
- live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt
- 
-  9916 Oct 01 11:10:33.162056 ubuntu-focal-iweb-mtl01-0026751352 nova-
- compute[67743]: DEBUG oslo_concurrency.lockutils [None
- req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
- LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project]
- Lock "45adbb55-491d-418b-ba68-7db43d1c235b" acquired by
- "nova.compute.manager.ComputeManager.terminate_instance.<locals>.do_terminate_instance"
- :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist-
- packages/oslo_concurrency/lockutils.py:355}}
- 
- In libvirtd we see the thread attempt and fail to terminate the process
- that appears busy:
- 
- https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
- live-migration/1cb6fb3/compute1/logs/libvirt/libvirt/libvirtd_log.txt
- 
- 77557 2021-10-01 11:10:33.173+0000: 57258: debug : virThreadJobSet:93 : Thread 57258 (virNetServerHandleJob) is now running job remoteDispatchDomainDestroy
- 77558 2021-10-01 11:10:33.173+0000: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-0000000d pid=73535 flags=0x1
- 77559 2021-10-01 11:10:33.173+0000: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0
- [..]
- 77673 2021-10-01 11:10:43.180+0000: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL
- [..]
- 77674 2021-10-01 11:11:13.202+0000: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy
- 77675 2021-10-01 11:11:13.202+0000: 57258: debug : qemuDomainObjBeginJobInternal:9416 : Starting job: job=destroy agentJob=none asyncJob=none (vm=0x7f93400f98c0 name=instance-0000000d, current job=none agentJob=      none async=none)
- 77676 2021-10-01 11:11:13.202+0000: 57258: debug : qemuDomainObjBeginJobInternal:9470 : Started job: destroy (async=none vm=0x7f93400f98c0 name=instance-0000000d)
- 77677 2021-10-01 11:11:13.203+0000: 57258: debug : qemuProcessStop:7279 : Shutting down vm=0x7f93400f98c0 name=instance-0000000d id=14 pid=73535, reason=destroyed, asyncJob=none, flags=0x0
- 77678 2021-10-01 11:11:13.203+0000: 57258: debug : qemuDomainLogAppendMessage:10691 : Append log message (vm='instance-0000000d' message='2021-10-01 11:11:13.203+0000: shutting down, reason=destroyed
- 77679 ) stdioLogD=1                                                                   
- 77680 2021-10-01 11:11:13.204+0000: 57258: info : qemuMonitorClose:916 : QEMU_MONITOR_CLOSE: mon=0x7f93400ecce0 refs=4
- 77681 2021-10-01 11:11:13.205+0000: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-0000000d pid=73535 flags=0x5
- 77682 2021-10-01 11:11:13.205+0000: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0 
- 77683 2021-10-01 11:11:23.213+0000: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL
- [..]
- 77684 2021-10-01 11:11:53.237+0000: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy
- 
- Towards the end we get an idea why this could be happening with the
- following virNetDevTapDelete failure:
- 
- 77689 2021-10-01 11:11:53.241+0000: 57258: error : virNetDevTapDelete:339 : Unable to associate TAP device: Device or resource busy
- 77690 2021-10-01 11:11:53.241+0000: 57258: debug : virSystemdTerminateMachine:488 : Attempting to terminate machine via systemd
- 
- libvirtd then attempts to kill the domain via systemd and this
- eventually completes but in the meantime the attached volume has
- disappeared, this might be due to Tempest cleaning the volume up in the
- background:
- 
- 77955 2021-10-01 11:11:54.266+0000: 57258: debug : virCgroupV1Remove:699 : Done removing cgroup /machine.slice/machine-qemu\x2d14\x2dinstance\x2d0000000d.scope
- 77956 2021-10-01 11:11:54.266+0000: 57258: warning : qemuProcessStop:7488 : Failed to remove cgroup for instance-0000000d
- 77957 2021-10-01 11:12:30.391+0000: 57258: error : virProcessRunInFork:1159 : internal error: child reported (status=125): unable to open /dev/sdb: No such device or address
- 77958 2021-10-01 11:12:30.391+0000: 57258: warning : qemuBlockRemoveImageMetadata:2629 : Unable to remove disk metadata on vm instance-0000000d from /dev/sdb (disk target vda)
- 77959 2021-10-01 11:12:30.392+0000: 57258: debug : virObjectEventNew:631 : obj=0x7f934c087670
- 77960 2021-10-01 11:12:30.392+0000: 57258: debug : qemuDomainObjEndJob:9746 : Stopping job: destroy (async=none vm=0x7f93400f98c0 name=instance-0000000d)
- [.]
- 77962 2021-10-01 11:12:30.393+0000: 57258: debug : virThreadJobClear:118 : Thread 57258 (virNetServerHandleJob) finished job remoteDispatchDomainDestroy with ret=0
- 
- Back in n-cpu we see the vif unplug correctly:
- 
- https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
- live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt
- 
-  9990 Oct 01 11:12:30.475937 ubuntu-focal-iweb-mtl01-0026751352 nova-
- compute[67743]: INFO os_vif [None
- req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
- LiveMigrationTest-22574426 tempest-LiveMigrationTest-225744
- 26-project] Successfully unplugged vif
- VIFOpenVSwitch(active=True,address=fa:16:3e:68:e0:63,bridge_name='br-
- int',has_traffic_filtering=True,id=ea034ac1-f87f-4319-8c17-a9748af7990d,network=Network(cff81725-
- 6326-4be1-97f5-116b0b60d276),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapea034ac1-f8')
- 
- 
- We then encounter issues disconnecting the attached volume from the instance:
- 
-  9993 Oct 01 11:12:30.479947 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: DEBUG oslo_concurrency.lockutils [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] Lock "connect_volume" acquired by "os_brick.initiator.connectors.iscsi.ISCSIConnector.disconnect_volume" :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist-      packages/oslo_concurrency/lockutils.py:355}}
- [..]
- 10122 Oct 01 11:13:30.641380 ubuntu-focal-iweb-mtl01-0026751352 nova-compute[67743]: WARNING nova.virt.libvirt.driver [None req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] [instance: 45adbb55-491d-418b-ba68-7db43d1c235b] Ignoring Volume Error on vol ba95db3d-c2ff-40c6-ba7c-563c3070c85d during delete Unexpected error while running command.
- 
- Finally there's another ~60 second gap until the request finally
- completes:
- 
- 10135 Oct 01 11:14:33.313929 ubuntu-focal-iweb-mtl01-0026751352 nova-
- compute[67743]: INFO nova.compute.manager [None
- req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
- LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project]
- [instance: 45adbb55-491d-418b-ba68- 7db43d1c235b] Took 240.14 seconds to
- destroy the instance on the hypervisor.
- 
+ See c#2 for the actual issue here.
  
  Steps to reproduce
  ==================
  
  LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any
  live migration test with trunk ports fails during cleanup.
  
  Expected result
  ===============
  
  Both the test and cleanup pass without impacting libvirtd.
  
  Actual result
  =============
  
  The test passes, cleanup locks up the single thread handling the
  libvirtd event loop in 6.0.0.
  
  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
-   list for all releases: http://docs.openstack.org/releases/
+   list for all releases: http://docs.openstack.org/releases/
  
-    stable/xena and master
+    stable/xena and master
  
  2. Which hypervisor did you use?
-    (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
-    What's the version of that?
+    (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
+    What's the version of that?
  
-    libvirt (6.0.0) and QEMU
+    libvirt (6.0.0) and QEMU
  
  2. Which storage type did you use?
-    (For example: Ceph, LVM, GPFS, ...)
-    What's the version of that?
+    (For example: Ceph, LVM, GPFS, ...)
+    What's the version of that?
  
-    N/A
+    N/A
  
  3. Which networking type did you use?
-    (For example: nova-network, Neutron with OpenVSwitch, ...)
+    (For example: nova-network, Neutron with OpenVSwitch, ...)
  
-     Trunk ports.
+     Trunk ports.
  
  Logs & Configs
  ==============
  
  Initially discovered and discussed as part of
  https://bugs.launchpad.net/nova/+bug/1912310 where the locking up within
  libvirtd causes other tests to then fail.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1945983

Title:
  Duplicate iSCSI initiators causing live migration failures

Status in devstack:
  New
Status in OpenStack Compute (nova):
  Incomplete

Bug description:
  Description
  ===========

  See c#2 for the actual issue here.

  Steps to reproduce
  ==================

  LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any
  live migration test with trunk ports fails during cleanup.

  Expected result
  ===============

  Both the test and cleanup pass without impacting libvirtd.

  Actual result
  =============

  The test passes, cleanup locks up the single thread handling the
  libvirtd event loop in 6.0.0.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     stable/xena and master

  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     libvirt (6.0.0) and QEMU

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

      Trunk ports.

  Logs & Configs
  ==============

  Initially discovered and discussed as part of
  https://bugs.launchpad.net/nova/+bug/1912310 where the locking up
  within libvirtd causes other tests to then fail.

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack/+bug/1945983/+subscriptions



References